Adam, Eve and Population Genetics: A Reply to Dr. Richard Buggs (Part 1)

@RichardBuggs,

Surely @DennisVenema is not the only person who can do genome mathematics.

What test or study results can you offer that would indicate an answer closer to 2 than to 10,000? Certainly on a scale of difference that large, it should be relatively easy to offer some general results from your side of the divide.

Hi Tim, I’m not saying we are at an impasse on this whole issue - just the point about what Dennis was saying in that particular, but very important passage of his book.

I would invite you to step in and help us. As far as I can see you are not a biologist, so you can help adjudicate between us about what is the plain meaning of the passage to readers. Perhaps @TedDavis could also pitch in to, as he finds Dennis’s writing to have great clarity, as he has mentioned above. As a historian, he must be used to looking closely at the meaning of texts. I would also welcome the view of @glipsnort as a geneticist, and perhaps @Christy could step in as moderator. I would also welcome other readers to pitch in and give their opinion.

The questions I ask you are, when you read the extract from Adam and the Genome in bold below, which I show in its context:

  • Does the passage make you think that it is referring to a scientific study where a few genes have been selected and the number of alleles of those genes in current day human populations have been measured?

  • Does the passage make you think that someone has done calculations on these genes on a computer that have indicated that the ancestral population size for humans is around 10,000?

  • Does the passage make you think that this is a different method to the PSMC method?

Here is the passage that we are discussing in its context in Adam and the Genome:

...given the importance of this question for many Christians— and the strong insistence of many apologists that the science is completely wrong— it is worth at least sketching out a few of the methods geneticists use that support the conclusion that we descend from a population that has never dipped below about 10,000 individuals. While the story of the beleaguered Tasmanian devil provides a nice way to “see” the sort of thing we would expect if in fact the human race began with just two individuals, scientists have many other methods at their disposal to measure just how large our population has been over time. One simple way is to select a few genes and measure how many alleles of that gene are present in present-day humans. Now that the Human Genome Project has been completed and we have sequenced the DNA of thousands of humans, this sort of study can be done simply using a computer. Taking into account the human mutation rate, and the mathematical probability of new mutations spreading in a population or being lost, these methods indicate an ancestral population size for humans right around that 10,000 figure. In fact, to generate the number of alleles we see in the present day from a starting point of just two individuals, one would have to postulate mutation rates far in excess of what we observe for any animal. Ah, you might say, these studies require an estimate of mutation frequencies from the distant past. What if the mutation frequency once was much higher than it is now? Couldn’t that explain the data we see now and still preserve an original founding couple? Aside from the problems this sort of mutation rate would present to any species, we have other ways of measuring ancestral population sizes that do not depend on mutation frequency. These methods thus provide an independent way to check our results using allele diversity alone. Let’s tackle one of these methods next: estimating ancestral population sizes using something known as “linkage disequilibrium.” [Then, the text describes the LD study and continues]...The results indicate that we come from an ancestral population of about 10,000 individuals— the same result we obtained when using allele diversity alone... [Then a little later the chapter continues] A more recent and sophisticated model that uses a similar approach but also incorporates mutation frequency has recently been published. This paper was significant because the model allows for determining ancestral population sizes over time using the genome of only one individual. [It then describes the PSMC method, saying of it]... Instead of looking at a given pair of loci in many individuals, this method looks at many pairs of loci within one individual....this is in good agreement with previous, less powerful methods,
I look forward to your and other readers' answers to my questions.
1 Like

Time for me to comment. I’ll break this up into pieces, and starting with prior thinking on the subject.

The hypothesis is that there was a bottleneck of size two in the immediate human lineage. For me, the plausibility of the hypothesis (i.e. whether it is one I would think likely enough to be worth investigating) depends critically on the timing of the bottleneck. Here’s how I think about it:

  1. Population bottlenecks distort the allele frequency distribution in the bottlenecked population. A bottleneck of size two massively distorts it. It takes time for that distortion to be erased by genetic drift.
  2. The characteristic timescale (in generations) for genetic drift in diploids is twice the effective population size.
  3. Human populations from Africa show allele frequency distributions that are broadly consistent with a constant population size plus relatively recent expansion.
  4. The relevant effective population size for humans is roughly 10,000. (Probably higher, actually, given current estimates of the mutation rate.)
  5. Human generation time is roughly 25 years. (Again, recent estimates put it higher.)

From these facts, my working assumption is that this kind of bottleneck would be detectable for at least ~500,000 years, and that such a tight bottleneck within the last 250,000 years would leave the kind of evidence that researchers would have seen just by looking at allele frequency data.

That’s my intuition. Testing that intuition requires some work, which I will attempt to describe after my malaria genetics meeting.

7 Likes

I don’t have a copy of the book in front of me, but I don’t have any reason to think the blocked text is not accurately quoted and edited.

So, my answer to each of your three questions is, Yes.

I am indeed a true ignoramus on these issues, having never studied any biology after one year in high school roughly fifty years ago. My science degree was in physics, and my research lab experience in astrophysics, and none of it at all recent. Nearly all of the science in Dennis’ book and in this thread is totally unfamiliar to me. My only point of contact is that I did read Ayala’s 1994 paper a few years ago, and I understood the takeaway message about a group of ancestors ca. 10,000 much further back than 6000 years. I can’t tell you how he and his team got there, and I can’t tell you what the PSMC method is. Like almost everyone else reading this, I don’t know what I’m talking about, when it comes to the science. All I can say is that no one in this discussion appears to have made an argument that violates a fundamental physical law. :slight_smile:

I appreciate you coming into this conversation, Dr Buggs and Dr Schaffner. Dennis already devotes a lot of time to helping people like me understand biology, but I wish more biologists would do that. I realize that the demands of running a research laboratory effectively preclude that type of activity as regular thing, and one’s peers tend to frown on it. Ditto in the humanities. IMO, however, experts in any field have no right to complain about the dismal state of popular knowledge about their field, if they haven’t tried to address it themselves–or, at least, if they haven’t properly supported those colleagues who do make the effort.

8 Likes

Richard,

As far as I can tell, Dennis makes three claims most relevant to your point: One, that there is a method to estimate minimum ancestral population sizes based upon measurements of number of alleles across various genes present in a population, and that this method indicates a population of approx. 10,000. Two, that an independent method exists that does not rely upon estimates of past mutation rates, involving “linkage disequilibrium,” that converges upon the same ancestral population size of 10,000. Three, that there has been a more recent method that is similar (not identical) that is not independent of mutation rate but also converges on similar results, namely the PSMC method.

Of these three approaches, Dennis’s support for the first seems to derive mostly from calculations on collected data. Presumably done by himself or others. Of the latter two approaches, that does seem to be something that is published and to which he could (and I think did) direct you. But I’m unclear as to whether the published studies for the latter two methods explicitly state Dennis’s conclusions or if he is drawing as well primarily on their collected data for support. I’m perhaps at a bit of a handicap on this as I’m relying on only excerpts of his book here on this thread. But to your point I do believe he describes three distinct methods. I’m eager to hear more about the sort of calculations conducted in these methods and how they may or may not support Dennis’s argument. That is what I am looking forward to in his remaining parts to this topic.

First, the data. Here is the allele frequency distribution for the combined African population in 1000 Genomes Project data:


This includes all single-base variant sites from the 22 autosomes that have exactly two alleles in the data, i.e. some people have an A at one site, while others have a T there, and no third variant is found. I have only taken sites with a minor allele frequency greater than 1%. I estimate the effective size of the genome being assayed here to be 2.4 billion base pairs.

The frequency on the x axis is that of the derived allele, i.e. the base that is different from the ancestral state, as inferred from primate relatives. Theoretically, this distribution falls as 1/frequency for an ideal, constant-sized population. New variants appear when a mutation occurs, and initially appear as a single copy, which means an allele frequency of 1/2N, where N is the population size. (It’s 2N because each person has two copies of the genome.) The frequency will then wander randomly from generation to generation (“genetic drift”), and in some cases will eventually wander to high values.

The rise near the right edge results primarily from misidentification of the ancestral allele (for ~2% of sites); the true frequency for many of those sites is one minus the frequency shown. The little jiggles along the curve are artifacts from binning, not noise – there is a lot of data here, and very little statistical uncertainty.

Here is the same data (black) compared to the prediction for a population with constant size (red):


The observed distribution follows the prediction very well above 20% frequency, and is higher than predicted as lower frequencies, which is indicative of fairly recent population expansion. I made the predicted curve with a forward simulation (for those who know about genetic simulators), previously published, using an effective population size of 16,384 and a mutation rate (1.6e-8/bp/generation) drawn from David Reich’s recent study (here). I chose the mutation rate to be conservative, since it is the highest recently published estimate that I know of. (The lower the mutation rate, the longer it takes to generate the diversity lost through a bottleneck, and the easier it is to detect the bottleneck.) I chose the population size because it’s a power of two (convenient for when I start modeling the bottleneck) and in the right ballpark for humans. The chosen population size and mutation rate happen to give a predicted curve that’s pretty much bang on the empirical data, without any tuning needed.

So. . . the point of this exercise will be to determine whether a model with a recent bottleneck of size two can reproduce the data distribution. Comparisons in next post.

5 Likes

Bottleneck simulations:
I modeled the Adam and Eve bottleneck as a constant-sized population, followed by a sudden collapse to 2 individuals, after which the population doubles every generation until it reaches a new, fixed value. I assume 25 years per generation. Here is the resulting frequency distribution for a bottleneck 100,000 years ago, with a final population size of 16,384:

The two dotted lines show the contribution from genetic variation that survived the bottleneck (red) and from mutations occurring after the bottleneck (green). The distribution for pre-bottleneck variation, which would have had the characteristic 1/frequency appearance originally, is nearly flat after going through the bottleneck; that’s what I mean by a massive distortion of the spectrum. That’s also why it’s not really relevant that a lot of heterozygosity makes it through a bottleneck: the bottleneck still has dramatic effects on diversity.

The contributions from before and after the bottleneck are effectively independent. This means I can increase the pre-b contribution by increasing the original population size, pretty much however is needed to agree with the data. For future comparisons with data, then, I will scale ancestral population as needed to match the data in the 60-70% frequency range.

Here is how this particular model compares to the empirical data we’ve already seen:


It might not look too bad at a glance, but the agreement here is terrible in the region of interest. In places, there are more than three times as many variants as predicted. There simply has not been enough time for mutation to generate new variation, and for genetic drift to increase their frequency substantially. I know of no biologically plausible process that would make this model work. A smaller post-bottleneck population has more drift, so the peak gets smeared out more, but also has half as many variants. Here’s what the same simulation looks like with a post-bottleneck population of 4000:

(More comparisons in a bit, after I go back and edit my last post.)

5 Likes

Based on the previous results, Adam and Eve (as a unique pair of ancestors) are simply not credible within the last 100,000 years. Not at all. How about longer ago? My intuition was that anything in the last 250,000 years would be easy to rule out – so that’s what I modeled.

Here is the comparison for a 250,000 year old bottleneck, with the usual population size of ~16,000:

We’re getting a lot closer, but we’re still not there. There are still around twice as many observed variant sites as predicted in places, since there still hasn’t been time to fill out the depleted part of the distribution. Much larger or smaller final populations make things even worse.

Based on allele frequencies, then, 250,000 years seems to be too recent for a two-person bottleneck, even just judging the distributions by eye. For even earlier dates, I would want to use more rigorous statistical tests, which should also be more sensitive. Exactly how far back you can exclude a single couple gets murky and would require a lot of study, which is why I usually give “several hundred thousand years” as the likely excluded region.

Questions are welcome.

6 Likes

Hi Steve. Thanks for doing this. Can you show what 500,000 or 1 million look like, or is that too computationally intensive?

Ann

2 Likes

This question is simply out of curiosity and not questioning the nature of the simulation (I have downloaded a couple of papers).

Are your simulations based solely on data on current population(s), or do you have data that directly addresses past populations as well?

Here’s 500,000 years; I’m running 1 million now. To approximate a constant-sized initial population, I simulate a pop of 10,000 for 100,000 generations to start each simulation, so a million years at 16k pop size isn’t much of a computational burden.

As you move the bottleneck date earlier, I start to worry that more complex demographies really should be explored for a better fit. Something like a pop size of 30,000 after the bottleneck for 250,000 years, a second, modest bottleneck, then another 250,000 at 30,000 might be a way to generate lots of mutations and still get enough drift to shift them to higher frequencies. But I don’t want to undertake this as a research project.

2 Likes

The empirical data I’m trying to match is purely from current population data. The mutation rate is a fixed parameter in the model. It can be estimated from comparison with ancient DNA, or from comparison with another species, but the estimate I’m using is based on data from modern populations. The generation time also comes from data on modern populations.

Thanks for this

Should the legend say 500 kya simulation instead of 100?

Yes, it should. That’s what happens when you cut and paste in your R script.

Classic question: how large would the error bars on that plot be, given the uncertainty in the mutation rate estimate? Don’t bother if it costs effort to figure out the answer, I’m already amazed that you have time on your hands to run these simulations.

2 Likes

1 million years. By eye, this doesn’t look any different than the constant-sized population compared to data that I posted earlier. Add a population expansion and it would fit well.

2 Likes

Not a simple question, in this case. The mutation rate is poorly known, and why it’s poorly known is also not well understood. Estimates range over roughly 1.1 to 2.0 x 10^-8; most recent estimates have been near the low end of that range. To a varying extent, changing the mutation rate can be absorbed by changing the population size – completely for a constant-sized population, less so for genetic drift after the bottleneck.

What I have done is run a couple of the simulations using the highest mutation rate within the above range (2.0 x 10^-8). Here is 250,000 years, pop size = 16,000. You can compare it to the plot above for the same age, which was done with mut rate = 1.6 x 10^-8.

1 Like

Thank you so much for doing these analyses, Steve. I was hoping that my Nature Eco Evo blog would stimulate some studies that set out to explicitly test the bottleneck of two hypothesis, and this is certainly a big step in that direction.

As I begin to comment on this, I think I should say for those reading in who are not in the science world that Steve Schaffner is right at the top of the field when it comes to human genomics, and was one of the authors of the 1,000 genomes paper (and many other highly cited and very significant papers too). It is a real privilege to us all who are interested in this issue to have Steve running simulations on the two person bottleneck hypothesis, and to be taking the time to answer questions on it.

I would also note that the fact that we are discussing these new simulations is in itself very good backing for the point I made in my blog that more research is needed on this issue. It highlights how mistaken it is to declare that we can be as certain that there has not been a two person bottleneck as we can be that the earth rotates around the sun. After all, if I were to question the latter, no one would need to go away and do a simulation to come up with new evidence for it, in order to be persuasive.

Steve, I am very interested in your analyses. I had expected allele counts at polymorphic loci to be the biggest argument I would come across against the bottleneck of two hypothesis. I was not expecting an argument from allele frequency spectra. I am delighted to come across this possible way to test the hypothesis that I had not thought of, and that was not mentioned in Dennis’ book chapter.

I am still going to take a bit of convincing that this is a good approach to testing the hypothesis, however. I will explain my reasoning below. I would underline that I know you see what you have done as just a preliminary study and you yourself are well aware of the approximations and simplifications that you have had to make. I will try to explain my points as simply as I can for our readers.

  1. Steve has already highlighted that this approach depends heavily on a correct estimation of mutation rates, and the model presented assumes that these do not vary with time or in different parts of the genome. This may not be the case in reality.

  2. Also, as far as I can see (Steve, do correct me if I am wrong), this approach depends on the assumption of a single panmictic population over the timespan that is being examined. I think it would be fair to say that there has been substantial population substructure in Africa over that timespan and that this has varied over time. To my mind, this population substructure could well boost the number of alleles at the frequencies of 0.05 to 0.2.

Let me just try to explain that in a way that is a bit more accessible to our readers. I am saying that Steve’s model (at least in its current preliminary form) is making the approximation that there is one single interbreeding population that has been present in Africa throughout history, and that mating is random within that population. However, the actual history is almost certainly very different to this. The population would have been divided into smaller tribal groups which mainly bred within themselves. Within these small populations, some new mutations would have spread to all individuals and reached an allele frequency of 100%. In other tribes these mutations would not have happened at all. Thus if you treated them all as a large population, you would see an allele frequency spectrum that would depend on how many individuals you sampled from each tribe. It is more complicated than this because every-so-often tribes would meet each other after a long time of separation and interbreed, or one tribe would take over another tribe and subsume it within itself. Such a complex history, over tens or hundreds of thousands of years would be impossible to reconstruct accurately, but would distort the allele frequency spectrum away from what we would expect from a single population with random mating. It gets even more complicated if we start also including monogamy, or polygamy.

  1. As far as I can see the model currently also assumes no admixture from outside of Africa. A group of people arriving in Africa from another continent would affect the allele frequency spectrum if they interbred, and if their non-African population had diverged from African populations. Obviously this could not have happened at time periods when there were no humans outside Africa. But the data under analysis is obviously of present day Africans after centuries of admixture from outside Africa. Steve may be able to account for this with a more complex model that excluded alleles that are common in non-African populations, although it would be hard to be completely sure about the origins of these alleles.

  2. As far as I can see, the model currently assumes no selection. Natural selection will boost the frequency of beneficial alleles (and alleles linked to an allele being selected for). Especially relevant would be alleles selected in one location and not another, and alleles under balancing selection. Steve would know better than me how to try to incorporate selection into the model, but my guess is that it would be very tricky.

Finally, could I ask, Steve, how many allelic variants did you assume in the founding couple, and what proportions of alleles did you put in them at 25% and 50%? Or did you assume that all variants arose through mutation?

2 Likes

@glipsnort. Thanks very much for doing this. When I asked about it being computationally intensive I had forgotten you only let the population double up to 16K and then held the population fixed.

We have planned to test the effects of varying various parameters, as you suggest. No more requests.

1 Like