You have been appropriately pushed to present specific numbers, and you have. Thanks. I am going to work that out here for others to follow.
I should point out, however, that I think it is already clear that the Zhao 2000 does not demonstrate evidence against (1) a couple bottleneck that was (2) 500,000 years ago, where (3) there was heterozygosity in this couple. The reason why is actually much more straightforward.
In genetics, TMRCA is always to 1 allele. However, in your scenario, there was never just 1 allele, because it started with 4. We need to know the Time to Most Recent 4 Alleles (TMR4A). This is not a standard computed number in genetics. Provisionally, I think that on average TMRCA / 4 = TMR4A. So given Zhao’s estimate TMRCA of 1 mya, we could estimate that the TMR4A is about 250,000 years ago, matching y-MRCA of about 200,000 years ago and mito-MRCA of about 200 years ago. That is, of course, well within your 500,000 cutoff. So I think we already know (back of the envelope) that the data can be explained by your model.
Someone tell me if I’m wrong in my estimate (TMRCA / 4 approximately equals TMR4A).
This is a pretty classic mistaken regardless. Genetics is not genealogy. Just because genetics has not coalesced, does not mean we are not down to one couple. In fact, if the couple is heterozygous, we do not expect TMRCA to equal bottleneck time. That is fairly obvious, but equivocated time and time again.
However, there is value in working out the math for observers. Of course, if I have made an error (which does happen sometimes) please do point it out. I will fix it.
@RichardBuggs Simple Model for Variant Number in Zhao 2002
For those who want to see the math, some of it is here: https://en.wikipedia.org/wiki/Genetic_drift, but most I’m going to be doing here is from memory. I am using simplified equations at times.
First off, if we are talking of a single couple bottleneck about 500,000 years ago, it will take time to get back up to 10,000 individuals, somewhere between 300 to 1000 generations (depending on growth rates). However, that is just a small effect on your estimate of 20,000 generations, which is actually a bit low any ways. So, let’s just say 20,000 generations.
Next, your estimate of 1.1x10-9 mutations / bp is reasonable. We can compute about how many mutations we expect in the whole population each generation. 10,000 individuals * 1.1x10-9 mutations / bp * 10,000 bp --> 0.11 mutations per generation in this region.
We can also compute the mutation rate in this region as a whole as 10,000 bp / region * 1.1x10-9 mutations / bp -> 1.1x10-5 mutations in region.
That means every about every 9 generations, one person in a population of 10,000 will have a mutation in this region. So this region would be mutated 0.11 mutations/generation * 20,000 generations -> 2,200 times. This matches @RichardBuggs’s number.
How many of these are going to be fixed?
Well, we know that that with random drift, it will take about 10,000 generations (the number of individuals), for a mutation to be fixed. But the variance is high here. Someone who knows better (@glipsnort?) correct me if I am wrong, but the variance is much higher. Regardless, we can expect about 1.1x10-5 mutations / region * 20,000 generations --> 0.22
mutations to fix during this time in this one region.
To be clear, there are few major simplifications in that last number. First, it assumes equilibrium, which is not the case. Remember, in the very first ~500 generations, the population is growing; so there are fewer mutations there. At the same time, when the population is growing, it is much easier for a mutation to be fixed.
And if 0.22 mutations are fixed assuming equilibrium conditions, and it takes 10,000 generations to fix a mutation, it is expected to see several mutations “in transit” during this time, and not fully fixed. As @RichardBuggs says, 78 is not a problematic number.
So, provisionally speaking, it does seem possible, maybe even likely, that in this scenario we could observe this 78 SNP variants (perhaps even at these distributions) in the population today from a primordial pair 500,000 years ago. For observers, it is critical to recognize that a couple at this point in history would not be a modern Homo sapien. This also does not explain all the data (if I am right) but just this single region. An effective model would have to explain all the data, not just this area. To be taken seriously, one has to refrain from claiming success in one area is the same as declaring success everywhere. Moreover, this is merely a qualitative analysis of the sequencing data (as I have not actually verified that only a few variants are required). A better analysis would do a quantitative analysis. Moreover, this does not take into account the linkage between all the variants.
Once again, this is consistent with the TMR4A number I computed above, so no real surprise if that can be trusted.
Notably, the same math that shows a bottleneck of one couple at 500,000 years ago fits this narrow component of this small part of the data, is the same math that gives strong evidence that we share common ancestry with the great apes. It is important to keep this in mind too.
What About Linkage and Recombination?
The analysis above ignores linkage, the adjacency in the chromosomes. We can do a similar analysis for recombination. My intuition here (for what its worth) is that this is not enough to make a definitive statement about a bottleneck of 2 at 500,000 years ago, but it’s difficult to know for sure from eyeballing. At issue here is the distance between variants (which is not included in the image).
The rate of recombination is approximately 1% per million bp, or 10^-4 recomb / individual in this region of 10,000. Eyeballing it, there about 20 variable regions, so we would say that the recombination between variants (assuming equally spaced) is about 5x10^6 / individual between adjacent variants.
Similar calculations as before follow. Expect there to be 10,000 * 5x10^-6, or 0.05 per generation. So every 20 generations, some individual somewhere will recombine betwixt one of the loci here. About 1 individual per generation will show recombination in at one of the 19 regions between variant loci.
How many of these are going to fix?
In the region, we expect 10^-4 recomb in whole region / individual * 20,000 generations --> 2 recombinations to have fixed. This is 10 times more fixed recombinations than the number of fixed SNP variants. That means we should expect more recombinations “in transit” to fixation than SNP variants themselves.
Notice, that is much faster than the mutation rate, but there is a twist. Not all of these times a recombination is detectable, because there really needs to be heterozygosity here for it to matter. Honestly (help me @glipsnort?) I’m not sure what the correction factor is here. My guess is that it would about 50% less than we observe because of an intuition from Hardy-Weinberg, and then an additional 50% less because of other factors, so 25% of that rate. This, however, is a major fudge factor that should probably reduce the estimates further.
As a simple (and wrong estimate), parameterized by the data itself may be possible. if we observe 78 SNP variants, we should also see 780 recombination variants too. With our fudge factor, maybe 200.
Though, really, no one should trust this analysis as anything more than qualitative (because of a lot of assumptions here). More care (as is done in published) work must be given to the distances between markers and considering several places in the genome. I’ve only worked this out here to demonstrate how some of the mathe works, and to show that we actually expect to see a lot recombination in this 10kbp region in 500,000 years. And I expect this computation to refined with more information about the data and better formulas, and also recognition of the actual sampling distribution (and past population structure).
Better theoretical analysis of this can be found here: http://www.pnas.org/content/98/24/13757.full For those that want a real treatment, look there, and references within. And ultimatley this is what whole genome wide LD analysis does, and such analysis does not detect any bottlenecks. TMRCA, ultimately, is not as important as effective population size estimates on the whole genome.
Allele Clustering is Unconvincing This Far Back
This argument, from the beginning, was quite weak, because clustering is notoriously subjective. Any data can be clustered, but how many clusters are in the data? That is much more difficult to determine from “eyeballing” the data as we done here.
First off, given recombination, we do not even expect this to be four allele clusters from a primordial pair. We would need to look at smaller regions than 10,000 bp (and actually look at them across the genome) to discern 4 alleles in that way. We would need to look at region sizes much smaller to expect 4.
Second off, given drift, even in smaller regions we expect some of the primordial alleles to have drifted away early on. So some regions would have to 4 ancestral alleles, some to 3, some to 2, some to 1. The notion that is should be four either (1) assumes no recombination or (2) that this couple had infinite children. In this case (assuming no miracles), we might say they had 10 kids or so (but not 200). That is going to create a distribution over 1 to 4 of the number of ancestral alleles, that might in principle be detectable. How far back? I have no idea, but I imagine if done systematically over the whole genome on small regions, we should see that signature quite farther back than 200,000 years ago.
This should be a good reminder that population genetics is not-intuitive. It really does help to work out the math. Eyeballing clusters in data cannot be a substitute for actually modeling the data with simulations and more rigorous treatments…
Move On From Zhao 2000?
From that, can we now move on from Zhao 2000?
I think there is good reason that this single paper’s data (in isolation) can be explained by a primordial couple that is heterozygous at 500,000 years ago. Once again, @DennisVenema has already admitted this is not the strongest evidence. Moreover, he never even considered a couple 500,000 years ago.
It is not that the 4 allele cluster argument was correct (it was not), but the whole premise that TMRCA in one autosomal location tells you where a single couple bottleneck happens is flawed. It is a category error. Moreover, recombination is happening this area, and we do expect to see it in this region.
Ultimately, I agree that the totality of the evidence shows no bottleneck, but there is also value in delimiting exactly what specific points of data (like Zhao 2000 which only looks at one 10kbp region) do and do not tell us. This data, in particular, seems to allow for @RichardBuggs’s hypothesis. Unless, someone can point out the error in my math, perhaps it is time to grant that point and move on.
Of course @DennisVenema or @glipsnort can correct me if I made an error here (and I may have). If I made an error, it really should be fixed, and I apologize ahead of time.