Adam, Eve and Population Genetics: A Reply to Dr. Richard Buggs (Part 1)

Jay313 · February 15, 2018, 10:47pm

This doesn’t make any sense. But before I get to that, here are the definitions of hominid and hominin according to the Australian Museum:

Hominid – the group consisting of all modern and extinct Great Apes (that is, modern humans, chimpanzees, gorillas and orang-utans plus all their immediate ancestors).

Hominin – the group consisting of modern humans, extinct human species and all our immediate ancestors (including members of the genera Homo, Australopithecus, Paranthropus and Ardipithecus).

I don’t think that God specially created Adam & Eve to contribute to a larger population that included chimps, gorillas, and orangutans. haha. Sorry. In any case, the only hominin that wasn’t extinct 10,000 years ago was us, formally known as Homo sapiens and colloquially referred to as human beings. A unique mating pair named Adam and Eve would not be the first H. sapiens by any definition, nor would they be the first humans, unless you want to strip even that fig-leaf of dignity away from these “pre-Adamic people” (or whatever moniker they may go by in your scheme).

Swamidass · February 15, 2018, 10:54pm

It is common for @DennisVenema to use “hominid” in the same way I just did. Feel free to take that up with him. Perhaps I could be more clear…

Also, the time line needed to be dropped. This would obviously be well before 10 kya.

DennisVenema · February 16, 2018, 12:19am

I do use hominid, but to refer to the common ancestral population that includes the lineage leading to chimpanzees (or further back). As far as I know, in Adam and the Genome, I use hominin - species more closely related to us than chimps. So, hominid isn’t a usual term I use. If you go far enough back in my writing, you’ll come to a time where I wasn’t consistent with my usage, but that is quite a ways back.

DennisVenema · February 16, 2018, 12:21am

Just keep in mind that “human” in my mind is shorthand for anatomically modern human. Also keep in mind that species designations are a fallible human attempt to draw lines on a continuum.

Lynn_Munter · February 16, 2018, 2:58am

There are (more or less) anatomically modern human fossils in this age range in Africa (and now Israel, too).

Jay313 · February 16, 2018, 11:32am

I gathered that. The “quote” you referenced was Swamidass’ words, not mine.

DennisVenema · February 16, 2018, 4:38pm

I don’t know why it quoted you - I knew that the word’s were Josh’s, not yours. Strange.

RichardBuggs · February 16, 2018, 6:58pm

Hi Steve,
Thank you for coming back to this, and for these useful comments, and for the anecdote about the Myers et al paper.

Regarding Terhorst and Song (2015):

I agree, but they also seem to be saying that it is not just a bottleneck that is hard to see through, but also any order-of-magnitude expansion of effective population size. See p7680: “This implies that for populations that have experienced roughly an order-of-magnitude increase in effective population size during their history, accurate estimation of demographic events that occurred before this expansion is difficult using SFS-based methods.” I would imagine that in the recent past the population of Africa has gone through a rapid increase of effective population size of at least an order of magnitude through both population growth and increased mixing among sub-populations. Wouldn’t it be hard to see back beyond this using SFS-based methods? I have to admit I have not mastered the maths in this paper, so I am just having to go on their discussion section.

Swamidass · February 18, 2018, 1:36am

Okay, here are my current thoughts on trans-species variation. I invite a deep dive in the literature to see if anyone can find a key paper I overlooked. Please prove me wrong if you can…

Trans-species variation. The evidence against an ancient bottleneck in trans-species variation is not as strong as I had thought.

As we have seen, there is a limit how far back the evidence from Human Variation gives us confidence against a single couple bottleneck. Before about 500 kya, it is possible that such a bottleneck, if brief, would be undetected in by current population genetics models. The specific number may be adjusted upwards by further analysis, but it’s a good starting point for now.

However, this is not actually the strongest argument put forward against a single couple bottleneck since we diverged from chimpanzees. For that, we have to look more closely at Trans-Species Variation.

Trans-Species Variation

Human Variation and Trans-Species Variation are related but different. To measure human variation, we look at a large number of human sequences. To measure trans-species variation, we look at a large number of both human and non-human sequences, usually chimpanzee. From looking at this data, we might find evidence of alleles that appears both in chimpanzees (for example) and humans.

This figure illustrates what appears to be happening:

https://discourse-cdn-sjc2.com/standard9/uploads/peacefulscience/original/1X/54227cd654c7fc4a6cf0b6fca9f4d0881807b708.png
Trans-species polymorphism in humans and the great apes is generally maintained by balancing selection that modulates the host immune response | Human Genomics | Full Text

The key point is that along each of the colored lines, several lineages are being shared between different species at a single place in the genome. Normally, there would be just one lineage on these time scales, but balancing selection maintains multiple lineages of alleles. By counting the number of allele lineages shared between humans and others, we can put a hard-stop lower bound on a bottleneck going back before humans and chimps diverge. Whatever bottlenecks there are they have to be big enough to include all the trans-species lineages.

Molecular Clock Not Valid

One tempting argument, which is not quite right, is to just estimate the TMRCA (or TMR4A) of these alleles, the same as we did across the genome, and use this as an estimate of a bottleneck time. This however, is an error.

Something called “balancing selection” is critical for enabling variation to last long enough to be shared this long between humans and other species, and this usually happens in proteins important for our immune response. So we see trans-species in only a few regions of the genome.

However, balancing selection violates the conditions required to accurately date variation in DNA. We cannot use our formula D = R * T here, because, in this case, we do not have a valid way of estimating R over these time frames. While in neutral regions of the genome, the average mutation rate works in our favor, at times we expect balancing selection to be increasing the rate of change in unpredictable and untestable ways. This can happen very rapidly as balancing selection can even select for increased mutation rates within this region.

Ayala’s Argument Against a Bottleneck

The argument here is two part. First, from effective population size estimates, and second from trans-species variation. I’m not going to engage the argument about effective population size, because it appears to be incorrect. Very tight bottleneck can still have high effective population size, and it seems Ayala missed this point. But this just takes us back to the TMR4A work.

This is where trans-species variation becomes important. It gives an independent way of dating alleles. If an allele in humans is closer to non-human alleles, it appears that it existed before those two species diverged, and was maintained by balancing selection to this day.

This study by Francisco Ayala was the first, to my knowledge, to make the case against a bottleneck by studying trans species variation HLA alleles. https://www.sciencedirect.com/science/article/pii/S1055790396900135

https://discourse-cdn-sjc2.com/standard9/uploads/peacefulscience/optimized/1X/27c8a98f8d062548e2133c5cbdef85044bcaee7f_1_263x500.jpg

This figure from Ayala shows human alleles with other primate alleles joined by similarity, not phylogenetic analysis that respects nested clades. I’ve highlighted the human alleles in this figure, and drawn red circles around 7 clusters of alleles which appear to be shared between human and other species. Remember, we can only put 4 alleles at each position in the genome of a couple, so this seems (at least on face value) to demonstrate there must have been at least 4 individuals in the tightest bottleneck of our ancestors.

Ayala’s summary is:

Figure 4 is a genealogy of the HLA alleles obtained by the UPGMA method, which assumes constant rates of evolution and thus aligns all 19 alleles at the zero- distance point that corresponds to the present. The ge- nealogy suggests that 8 allele lineages were already in existence 15 Myr ago, at the time of the divergence of the orangutan from the lineage of African apes and hu- mans; and that 12 allele lineages were in existence 6 Myr ago, at the time of divergence of humans, chimps, and gorillas.

The difference between his numbers and mine in how we determine lineages. There is some ambiguity in how we determine the cutoffs. Still, as long as we see more than 4 lineages with trans-species variation, its seems like evidence against a single couple bottleneck. From this, he argues,

There is, however, no evidence supporting the claim that ex- treme bottlenecks of just a few individuals, such as postulated by some speciation models (Mayr, 1963; Car- son, 1968, 1986), have occurred in association with hominid speciation events, or with major morphological changes, at any time over that last several million years.

This is probably correct, in that there is no evidence for a bottleneck that I can see. But he means here to mean that a bottleneck has not happened: i.e. there is evidence against a bottleneck in the last several million years. That may be incorrect.

Some Technical Asterix

Generally speaking, this work has been understood in the field to definitely discount any notion of a single couple bottleneck. On face value, that is certainly what it looks like. However, there are some big caveats.

The molecular clock based dates computed in these studies, it does not appear to be well calibrated.
We do not really know the confidence on any of these clusters, because Ayala did not estimate them using modern bayesian methods.
He also used a similarity based method to build the trees, rather than a true phylogenetic reconstruction. This is important, because it can produce different clusters.
It does not appear convergent evolution was accounted for in this analysis. Convergent evolution, at this level, can create the appearance of shared history when there is none.
His population simulation used a bottleneck lasting 10 generations (e.g. 10 individuals for 10 generations), which is much longer than the bottlenecks we are considering (e.g. 2, to 10, to 500, to 2500, to 12500).

While these are interesting results, at some point, this analysis needs to be done with better methods to really determine how many lineages are persistent over the last 6 mya. Moreover, effort to correct for convergent evolution is important here too. On the simulation size, a brief bottleneck needs to be considered, rather than just those of 10 generations.

A Finding Not Replicated

Ayala focused his work on HLA-DBQ1 (one of the MHC genes), but similar work has shown trans-species variation at other locations in the genome. However, I could not uncover a single other study that shows more than 4 lineages with tran-species variation.

I cannot do a full review here, but we can see the balancing at other genes, with fewer lineages in the end. For example…

Common chimpanzees have greater diversity than humans at two of the three highly polymorphic MHC class I genes - PubMed

This figure is fairly typical of findings…

A human-specific allelic group of the MHC DRB1 gene in primates - PMC

This figure shows a molecular clock based estimate (which do not appear well-callibrated) of 7 lineages at 6 mya, however, less than four lineages (0 in this case) is shared with chimpanzee. Reviewing several papers, I cannot find replication of Ayala’s findings of more than 4 lineages being shared between humans and other species.

We can see this pattern in this figure too…

A human-specific allelic group of the MHC DRB1 gene in primates - PMC

Here, the bold leaves are human sequences. Notice the difference between this figure and Ayala’s. There is numbers on the edges (which indicate confidence) and we just do not see nearly as many lineages in common. The authors here conclude there is just one lineage in common.

Here is another typical results figure:

Multiple instances of ancient balancing selection shared between humans and chimpanzees - PMC

Each tree is a different region of the genome. Notice, again, that there does not appear to be more than 4 clusters with both human + chimpanzee alleles.

While Ayala is an established scientist, his work was done in 1996, well before modern sequencing efforts, and modern bayesian analysis of phylogenetic trees. While no one has published on DBQ1 since he did, it is very surprising that no one else has replicated his result in the last 22 years on another locus. Of course, if someone can find a study that does, please let me know!,

The apparent failure to replicate this finding (with (1) much more data, and (2) improved methods), discounts substantially my trust in his findings. We just know much more about how analyze these sequences, and we have so many more of them. It is not surprising that our understanding might advance.

One Line of Evidence? One Paper?

At the moment, the Ayala paper appears to be the only study which shows more than 4 allele lineages with trans-species variation. His analysis, however, did not estimate confidence nor did it use phylogenetics to determine lineages. In 22 years, I cannot find a paper that replicates his finding. Certainly, trans-species variation has been observed, but not more than 4 lineages, as far as I can tell.

This is not enough evidence by which to make a confident claim against a single generation bottleneck.

The Way Forward

The right way forward, then, is to to study trans-species variation with the data we have now, but better methods than did Ayala. This takes some difficult work, however. I’m not 100% sure if we will give it a try here, but we might. This, also, is the most likely place a future study might uncover evidence against a single couple bottleneck.

Until that happens, however, I am not sure this is strong evidence against a brief bottleneck. I stand to be corrected, however, if someone can produce a study that shows this. If you find one, please send it to me.

Swamidass · February 18, 2018, 1:41am

Please, I want to know if I am wrong here. If you can find such a study, please let me know. Correct me if you can!

Swamidass · February 18, 2018, 11:11pm

I want to clarify a final point in public (as we hash out @RichardBuggs’s statement)…

Does Not Depend on Common Descent

Our conversation initially did presume common descent, however this result does not require this assumption. This result applies to everyone, not just those in the EC/TE camp.

The only way that common descent was used in this study was to determine the region specific mutation rate (which come out to an average of 0.7e-9 mutations / bp /year). The mutation rate, however, scaled down to 0.5e-9 mutations / bp / year, to be consistent with several independent studies that have directly measured genome wide mutation rates by many different methods. Read more about this here: Heliocentric Certainty Against a Bottleneck of Two? - #11 by swamidass - Peaceful Science

We also know that the mutation rate estimated using common descent is well correlated across the genome with the experimentally measured one. Even if not, this will not shift the curve, but just increase the variance. A median TMR4A will not be affected by this error.

What We Can Expect From PSMC and MSMC Simulations

We can and should do simulated populations to see the sensitivity of PSMC and MSMC to detecting ancient bottlenecks. However, this will mean making some clearly incorrect assumptions about the population in the past. For example, we could use a population of 10,000 to simulate bottlenecks in the past, but the numbers we derive here will not apply to human data, because we know that at times there was more than a population of 10,000 in the past.

Still, my hope is those simulations would give some theoretical and empirical support for using TMR4A (or closely related measure) to determine sensitivity of PSMC and MSMC. With that aim, however, I do not anticipate large changes in the methodology. It is possible the 500 kya could move up to 800 kya or so, but not much more. Though as I have been looking at the data and working out the theory, that as not likely as remaining below 600 kya or so.

We never know until doing the analysis, but my instincts here have been largely correct.

Swamidass · February 19, 2018, 4:43am

A few final thoughts on trans-species variation.

Convergent Evolution or Trans-Species Variation? A deeper look indicates convergent evolution, which violates the assumptions required for genetic clocks and undermines substantially the argument against a bottleneck using this line of evidence.

I wanted to further expand on this deficiency in Ayala’s study.

Convergent Evolution or Trans-Species Variation?

Convergent evolution, rather than shared history, is an alternate explanation of Trans-Specific variation. If we see several alleles in both humans and chimps clustered together, there are two possible explanations:

this could be because the allele lineaged existed in the common ancestor of the two species,
or it could be because of convergent evolution.

Ayala never considered or tested for this possibility. This is a critically important point, and why his use of similarity in phylogenetic analysis substantially undermines his point. In order to trust the tree, we need to know how many discordant mutations there are in the tree. However, he used similarity (not nested clades) to build the tree. He had no way to know if the data actually made sense as tree recapitulating common descent or not.

A basic feature of scientific thinking is to test hypotheses. Ayala’s paper did not rule out the hypothesis of convergent evolution.

Testing for Convergent Evolution

The good news is that convergent evolution leaves a tell tale sign. We should see a large number of mutations that cannot fit into a tree like structure if convergent evolution is at play. It turns out that several groups have been studying convergent evolution on a genomewide scale, and HLA types regularly are outliers in these analyses.

For example, take a look at this study. Parallel or convergent evolution in human population genomic data revealed by genotype networks | BMC Ecology and Evolution | Full Text It builds “allelic graphs”, which I won’t explain here in detail, except to say that when ever we see a cylcle, like this square, we know that it cannot fit into a tree structure:

https://discourse.peacefulscience.org/uploads/peacefulscience/original/1X/db392b3f0f3490279a4468533fe6b6a07a508d1e.jpg
Parallel or convergent evolution in human population genomic data revealed by genotype networks | BMC Ecology and Evolution | Full Text

We expect a few of these in neutral evolution, but not many. If the data fit a tree, we would only see a single path from the top genotype to the bottom one. However, if we see both paths, we know that different alleles are taking different paths, which means that they do not actually share history here. It is a type of homoplasy, and it is a signature of convergent evolution. Notably, this signature is specific to convergent evolution, is not likely caused by Trans-Species variation.

If we see a large number of these squares in the genetic diversity of a particular part of the genome, that is evidence that the similarity we see between sequences is not actually a signature of common descent. Rather, in these cases, another hypothesis is favored: convergent evolution.

So what do the authors find?

Well, HLA genes have a massive excess of squares, a clear sign of pervasive convergent evolution. Ayala’s gene HLA-DBQ1 is not mentioned in the text, but we find it in the supplementary data as one of the genes with clear evidence of convergent evolution.

Another gene, HLA-DRB1 is the most variable HLA gene. It is notable for having over 500 squares in the DNA of about merely 1,000 individuals, compared with an expected number of less than 10. That means if we had tried to put the DNA into a tree, we would see at least 500 mutations discordant with a phylogenetic tree. This is just a stunning result, because it means that HLA-DRB1 alleles are just not well described as a tree. The variation we see is evolving and re-evolving over and over again. Amazing.

It also validates my methodological concern about Ayala’s work:

This is not exactly a new result, back in 2000, a test of Ayala’s hypothesis was done on HLA-DBQ1. They also found strong evidence of convergent evolution. Human Verification However, the allele graph makes clear how much this affects the data. Perhaps more importantly, this Nature study from 1998 directly disputes Ayala’s paper, arguing that this is rapid convergent evolution: Recent origin of HLA-DRB1 alleles and implications for human evolution | Nature Genetics.

It is just not an accurate view of the data to present HLA-DBQ1 in a tree based on a similarity matrix. We cannot even correctly determine ancestral history among human alleles themselves, let alone between species. The data seems to look more like convergent evolution than standard common descent, i.e. Trans-Species variation.

Remember, Ayala did not even consider convergent evolution. He did not test for it. This seems to a valid alternative hypothesis, which also seems to better explain the data.

Moreover it is not really accurate to present trans-species variation as a settled finding of genomic science. At best, it is one competing hypothesis among many. However, it might even be accurate to say that it is the disfavored hypothesis. There are many more papers disputing Ayala’s findings than supporting it. No one should present this as as indisputable and settled evidence against a sharp bottleneck.

Perhaps the data will bear out Ayala’s initial hypothesis, but a lot of work needs to be done to demonstrate this to be the case.

What About Common Descent?

Everyone believes these alleles share common descent (at least back to 4 alleles). However, this is good reminder that genetic data can pick up signatures that erase the nested clade signature we usually see in DNA. Homoplasy is a real feature of the data, and expected even when there is common descent.

This is a great example of how there are rules in biology (e.g. DNA falls into nested clades), but there are exceptions (convergent evolution), that are very important to understanding this data.

Moreover, the next time someone points to mutations that do not fit the tree pattern in species, remember two things.

We expect a few discordant mutations, even in neutral evolution. That is not evidence against common descent.
Convergent evolution, also, can produce discordant mutations. Not usually ever as much as we see in HLA genes, but more than we expect from neutral evolution.
We observe homoplasy and convergent evolution in cancer (called recurrent mutations).
We observe homoplasy and convergent evolution in human variation (which everyone agrees shares common ancestry).

In case #2, we still expect to see a signature of common descent in most cases. However, it is such a pervasive pattern in HLA-DRB1 that it appears that the signature of common descent is erased, even though we all agree these alleles share common ancestry. And #3 and #4 are direct empirical evidence that convergent evolution is expected at a DNA level (#3) and thathomoplasy is observable in DNA everyone agrees shares common ancestry (#4).

Once again, the rule is that most (but not all) DNA fits into nested clades (a tree), but some does not. Neutral evolution produces nearly nested clade data, but positive selection (and balancing selection) can also lead to convergent evolution. Homoplasy (violations of nested trees) are expected in some genes, even more than we expect from neutral evolution.

The Median TMR4A Estimate Unaffected

It’s important to understand how these findings interact with the TMR4A esitmates.

The convergent evolution creates homoplasy that will artificially increase TMRCA estimates upwards. Because a tree is a bad fit for the data, it will be impossible to find a parsimonious tree. This will inflate the TMRCA values substantially. This reinforces what I’ve said from the beginning. The molecular clock, in these regions, is not well calibrated.

Another indicator of this is that a much larger fraction of mutations in this region are non-synonymous (i.e. not neutral). This is an indicator that positive selection is driving most of the changes a far more rapid rate than neutral evolution. The end result of this is artificially inflated TMRCA estimates. Remember, that D = T * R only in regions where dynamics like this are not taking place.

This does not, however, create a problem for our estimate of a bottleneck limit. Remember that we used the median of TMR4A over the whole genome. So, this estimate is not really influenced much by a small portion of the genome in error. The estimate shifts only about 2 kya per 1% of the genome in error. That is the reason we used the median in the first place, it makes the estimate remarkably stable to errors like this.

Convergent evolution is really the exception to the rule in human variation. It is not accounted for by most phylogenomic methods, but that does not matter in our genome wide analysis. Our final estimate is not strongly influenced by this problem.

Swamidass · February 19, 2018, 5:02am

Just to expand on this, some argue that homoplasy is evidence against common descent.

Convergence is a common characteristic of life. This commonness makes little sense in light of evolutionary theory.
Convergence: Evidence for a Single Creator by Fuz Rana of RTB
http://stag.reasons.org/explore/publications/facts-for-faith/read/facts-for-faith/2000/09/30/convergence-evidence-for-a-single-creator

The pervasive pattern of homoplasy, which is the term evolutionists use for similarities that cannot be explained by any conceivable pattern of common ancestry, undermines the logic of the argument. Common design explains all similarities, both homologies and homoplasies, but evolution cannot explain the pervasive homoplasies.3 The camera eye, evolutionists say, must have evolved independently six times! Labelling such things as due to ‘convergent evolution’ is pure circular reasoning and lacks any explanatory power.
Is evolution true?

Well, we see homoplasy in human variation, which everyone agree arise from common ancestors (at least down to 4 alleles). In fact, we see much more homoplasy in human HLA alleles than we do between species. That means we expect to see homoplasy in species level variation too.

This is hardly evidence, then, against common descent. It is what we expect from non-neutral evolutionary processes.

For those who have been following this for a while, we also see homoplasy in cancer too (https://biologos.org/blogs/guest/cancer-and-evolution), but we refer to it by a different name. We call it in cancer biology “recurrent mutations,” but we could just as easily call them homoplasies. The prediction from common descent is that DNA falls into nested clades, for the most part, but not exactly. We know there are processes that break this pattern.

Most DNA will fall into nested clades, but some will not. The structure of the biological world is nested clades, but not perfect nested clades. This not just a circular reasoning rescue for evolution, because we can see this arise in both cancer and human variation. So this is just an empirical and theoretical expectation of evoultion.

This, also is very closely related to some prior conversation with @Cornelius_Hunter about what common descent predicts vs design.

Signal and Noise - #19 by Swamidass

Notice that Remine argues that a common designer means natures should be in perfect nested clades (high CI). However, Fuz Rana argues that convergent evolution (which breaks the nested clade pattern) is evidence of a common designer (low CI). Instead, we find out that evolutionary theory (with common descent) can tell us why some features break the pattern, and others follow it.

The data fits neither Fuz or Remine’s model of a creator. Instead we find that God designed us through a process of common descent. Or at least the evidence looks that way. This is not evidence against design, but it evidence that God’s design principle was common descent.

Swamidass · February 19, 2018, 11:19am

HLA Introns Appear Ancient (and Recent?)

So looking into this, I did find some additional evidence for very ancient lineages.

Recent origin of HLA-DRB1 alleles and implications for human evolution | Nature Genetics

This paper is interesting and has some valuable information. They show that the coding sequences of HLA-DRB1 likely arise recently, which is consistent with the convergent evolution work I previously explained. They did this by studying the introns, the non-coding regions adjacent to alleles.

They key observation they make is about the introns of HLA-DRB1 alleles lineages. The introns within a lineage are closely related, arising about 250 kya ago using a mutation rate of 1.4e-9 mutations / bp / year. This indicates that they are not ancient. This is the evidence that undermines significantly Ayala’s argument.

However, the difference between introns in different lineages is much greater, suggesting they are much more ancient. This is an important finding, because introns are under much less selective pressure. For this reason, they will be more clock like. They estimate that there are about 7 lineages when humans and chimps diverge. This is much less lineages one would compute by looking at the exons, but it is enough to put a minimum bottleneck at 4 individuals, if this corrects. Also the mutation rate here is high, about 3 times higher than the genome wide rate. So, in this sense, it is a conservative estimate.

The discussion in the paper is complex, and not really possible to summarize concisely. But for those studying this question, it is a must read.

One Line Of Evidence Against A Bottleneck?

This does count as one new line of evidence, that deserves some consideration. I’m still not sure how much confidence can place in this. Balancing selection can select for increased mutation rates in this region of the genome. There is also a lot of evidence that mutations often come in clusters, not as singletons. The introns will be more neutral than the exons, but I am not sure how much we can trust them as a clock in this case.

For me the most strange feature data is that, under the author’s hypothesis, over 20 million years we almost never see recombination in this region in the introns . I am having a hard time believing that. It seems that a highly variable and high mutation rate in this region (and perhaps hitchhiking from mutational clusters) might be an alternate explanation for why the allelic lineages are this divergent. One way to test the hypothesis of very ancient allelic lineages here is determine if the introns (not the exons) have trans-species variation with non-human primates. That is a clear prediction of this model, which we really expect to see if these lineages are really this ancient.

However, this test of the model does not appear to have been done. It would be really interesting to see what the data might show. If we were to see trans-species variation of the introns, I might be enough to convince me. That, however, is a very difficult analysis to do correctly.

I’m particularly uncertain on how to square the high divergence between allelic lines with the absence of recombination reported here. If these are such ancient lines, we should see more recombination than the authors of these studies presume. This seems to be inconsistent with measured recombination in this regions, which appears to be very similar to the rest of the gnome (Recombination rates across the HLA complex: use of microsatellites as a rapid screen for recombinant chromosomes | Human Molecular Genetics | Oxford Academic). If these are such ancient alleles, why do we see no recombination? If they are missing recombination events too, then the allelic age will be overestimated. Though, there is some evidence that recombination here is selected against.

I’d be curious to hear comment from @glipsnort and @RichardBuggs on this. How do you weight this paper against everything else?

Swamidass · February 19, 2018, 8:10pm

Let me start by conceding that most colleagues will currently take the introns as de facto evidence against bottleneck. Maybe they are. However, there is this sticky problem of essentially no observable recombination in this model over literally tens of millions of year, even though we can directly observe recombination in this area.

With that, let me lay out the puzzling features of the data, and why this might be a better explanation.

The HLA Intron Puzzle

It all comes down to the age of the introns we observe in each lineage. However something is strange about the “clock” here.

If allele lineages introns have an ancient common ancestor (e.g. 20 mya), observed number of recombinations seems much lower than observed recombination rate.
If allele lineages introns have a recent common ancestor (e.g. less than 500 kya), observed number of mutations much higher than the observed mutation rate.
Whatever process we end up with, needs to explain why each alleles looks young by one measure (diversity within the allele) and very old by another (high divergence between allele lineages).

One thing that is particularly relevant here is that this paper is from 1998, so they are working from genomewide phylogenomic mutation rates, not directly observed and region specific mutation rates. We’ve now directly measured this in a regions specific way. So something is affecting these two clocks in opposite ways in a presumably neutral region of the genome. What could be going on to cause that? It seems that by concluding the allelic lineages coalesce anciently, the authors are just pushing the problem to recombination. This does not really solve the puzzle.

Developing Another Hypothesis

So, here is another option, that takes into account all that we know about genetics now, including information not available in 1998. I think that this pattern in the data might be in the interaction of several different processes. I am going to call this the Bystander Mutation to Hitchhiker Model.

First, remember what is going on in a directly adjacent area:

Very strong balancing, divergent, and positive selection in an adjacent region (the antigen binding exon).
Very high amounts of gene conversion in the adjacent region (the antigen binding exon).

Now to explain the extremely low amounts of recombination in the intron, I’ll invoke a good explanation already put forward in the literature.

Strong negative selection against exon shuffling recombination, because the function of each exon is dependent on the varian of the exon t it is adjacent too, much more than usually is the case.

Then, to explain the very high amount of mutation in the intron, I’ll recall some recently discovered information about mutation distributions.

Mutations often come in clusters, especially as a result DNA end-joining repair, which occurs in (for example) recombination and gene conversion. Other ways of putting this is that gene conversion events cause mutations. Or, mutations are often accompanied by adjacent mutations, much more than we expect by chance. We see a very clear signature of this in genetic variation data, confirmed by several studies, and also have clear molecular mechanisms for this process. http://science.sciencemag.org/content/329/5987/82.full?rss=1
There is much more convergent evolution happening in HLA exon than we had previously thought. So the rate of change in these exons is even higher than we thought.

[as an aside, this makes things like multiple mutations in a single exon or gene much more likely, thereby increasing the rate of evolution, and increasing the likelihood of crossing fitness barriers. ] Now getting back to a key fundamental of genetic evolution is a “neutral process”

Hitchhiking is a process that can rapidly fix neutral mutations (like those in the introns) that are linked to selected mutations (like those in the exons). Because recombination is clearly suppressed somehow (likely negative selection) in this region, the hitchhiking effect will be stronger.

Now we are in a position to think about what the interaction between these affects will. Notably, this is a highly unusual cluster of evolutionary processes working together. It is not likely this is taking place in most the genome. It also seems to depend on co-dependency between exon variants, which may not be equally important to all HLA loci.

Finally, the process that produces introns that are very similar within allele lineages, but dissimilar between them, would be the rare occurrence of a tolerated intron recombination, followed by very rapid evolution of that new recombinant intron by bystander mutation hitchhiking, that is also isolated from other lineages by negative selection of cross-allele recombination. It is possible, once the functional space is filled by the exons associated with the new recombinant intron, the intron evolution rate would slow, and the allelic introns within a lineage might be more neutrally evolving by drift and recombination-driven homogenization (alongside continual convergent evolution in the exon).

This might be a place where the complex dynamics in the Lenski experiment (referenced by @RichardBuggs) might be relevant too. Yes this is a complex theory, but this is also complex data, that does not fit into the expectations of neutral evolutionary theory. Remember, we only trust our formula D = T * R if the region is evolving neutrally. That does not seem to be the case here.

From this proposal, we tentatively expect…

There to be much lower recombination rate in the introns than expected from a neutral process (because of negative selection) between allelic lineages, but also more normal recombination rate within allelic lineages. So this will create very different dynamics between and within allelic lineages.
There to be much higher rate of change in the introns than the population level mutation rate alone would tell you. The newly mutated exons that are being continually selected for are strongly linked to introns that have many bystander mutations caused by the same process that caused by the newly mutated exon’s mutational process (e.g. gene conversion).
This process might produce phylogenies as observed, with low intron diversity by drift and recombination within allelic lineages, but high intron divergence between each lineage.

That means the introns are largely neutral with respect to function. Yes, there will be important sequences here, but most of the sequence is not under functional constraint. They are, however, not evolving neutrally, but strongly under the influence of the neighboring exons. Nor are they evolving at the same rate, but also in a highly variable rate.

Supported by Data?

I think that much more care needs to be carefully thinking through the implications of this proposal that I am giving here. This, after all, is not a scientific paper, but a forum post. However, it seems that this makes sense of the both the low recombination rate, and the discordance between the observed mutation rate in introns if the alleles is low. The two hypothesis, as I can tell, are:

The published hypothesis: Ancient coalescences of allele lineages, with the pattern caused by extreme negative selection on intron recombination over 20 million years, but neutral evolution of mutations in intron. THEREFORE, the molecular clock in introns is valid.
Bystander-Hitchhiker hypothesis: much more recent coalescences of allele lineages, with the pattern caused by moderate to strong negative selection on intron recombination, but high rates of bystander mutations hitchhiking along with the constant positive selection for new exon mutations. THEREFORE, the molecular clock in introns is invalid at the most ancient scales.

#2 was considered by the authors of the paper, but they did not know everything know now about mutation clusters and gene conversion. This changes the analysis substantially, by increasing the rate at which hitchhiking occurs by adding a strong correlation between selection and mutation in introns. Moreover, #2 seems to be more believable (at least to me) than extreme negative selection against recombination over 10s of millions of years.

Much more work could be done to work this out, and potentially validate this theory. First, we need to explicitly work on the math here. Second, we should look at other HLA genes, to see if there there is a discernible relationships predicted by that math. Third, we should look to see if the pattern in mutations matches that expected by gene conversion mutagenesis (they have some specific bias). Fourth, careful review of the evidence of direct measures (rather than phylogenetic inference) of mutation rate and recombination rates in this region are important. This is a truly massive effort to do correctly. I can tell you right now, that we are not going to do such a thing on a blog like this.

However, I #2 this seems like a hypothesis that makes more sense of the data we know right now from the literature. At the very least, it needs to be carefully considered and ruled out before we can have confidence in this as an argument against a bottleneck. As far as I can tell, I have not seen a paper that has done this. Though the literature in this area is vast. I might have missed it.

How Could We Date Them?

If this process is real, than the best date would be the age of the allelic lineages (250 kya), but not the dates of the coalescence of the distinct lineages (20 mya). However, I am not sure we could fully trust this either. A better theory or simulations might justify this somewhat. It would still be very difficult, because it requires having precise knowledge of the balance between several different mechanisms (e.g. negative selection on recombination, and bystander mutations), over long periods of time. And we also do not have a large number of loci to do error checking on. A priori, we also expect the selective pressures to be substantially changing over this period. I’m not sure it’s possible to come to confident date estimate based on mutation or recombination clocks.

That leaves trans-species variation as the other way of dating them. This, it seems, also is something not yet done in the literature. It is also a very difficult analysis to do correctly, partly because we do not have access to much chimpanzee data in this region (as far as I can tell).

Hypothesis #1 here, however, requires there to be extremely high selection against intron recombination, so these chimp introns should very closely alignable with human introns. We should see a very strong signal for trans-species variation. And we should see more than 4 lineages too.

If, on the other hand, if we see a lot of recombination between the human and chimp alleles in the intron alignment, that is evidence against the very strong negative selection recombination rate the hypothesis requires model. This would give some strong support to Hypothesis #2.

Of course, we cannot know without looking at the data (and that is difficult in this case). I’ll venture a guess that the observations of #2 are more likely.

Where Does that Leave Us?

I’m not sure this is strong evidence against a bottleneck in the end. There are some future studies we can imagine that could clarify the matter. However, based on current evidence, I do not think we can be sure the ages are well-calibrated in the HLA intron experiments.

So, it seems, we are back to TMR4A. Once again, the median estimate is validated as a wise early choice, as it automatically ignores outlier regions like this.

It is possible that better analysis of this data could change our minds. Wherever possible, I’ve tried to map out how future studies and evidence could help discriminate competing hypotheses. For those seeking to disprove a bottleneck, I think this is the most likely place that deeper analysis could change our view. For those really wanting to know what the data really tells us, this also is the place I think we might find the most important information.

However, at this point, we are not facing settled science at all, but the bleeding edge of inquiry into genomic science. I’m curious to see how this unfolds over the next decade. Until that is sorted out, I think our conclusion that a single couple bottleneck between 7 mya and ~500 kya is consistent with the data (not disproven by it), is still a reasonable interpretation.

RichardBuggs · February 19, 2018, 9:45pm

Hi Joshua @swamidass thank you for these very interesting analyses of Ayala et al (1994) and Bergstrom et al (1998). As you know, I mentioned the Ayala paper here in my initial response to @DennisVenema 's part 1 response to me. At the time I said that I thought it was the strongest argument available against a bottleneck of two.

I was puzzled as to why Dennis did not refer me to it, and now I wonder if he perhaps anticipated some of the criticisms that you have made of it. He certainly seemed less confident than I did that it was crucial to his case.

You have made a far more convincing case against Ayala’s paper than I could have done.

I think your point that Ayala’s findings have not been replicated since the human genome project is an interesting one, and your attribution of this to methodological limitations in his tree building methods, and failure to consider the alternative hypothesis of convergence sounds convincing to me.

I would like to put out there another possibility in addition to these, and that is the possibility that before the human genome project, researchers on human MHC loci may sometimes have confused alleles and paralogs. I know from my own experience that before we have a full genome sequence for an organism, it can be very hard to analyse regions of the genome that contain families of highly similar genes: it is very easy to confuse paralogs with alleles. I don’t know much about the DRB1 locus in humans and chimpanzees, but if there are paralogous copies of this gene, they might have been very challenging to identify before a highly accurate assembly of human MHC regions was complete. It is possible that this could be another reason why Ayala’s findings have not been replicated since the human genome project. This is just a speculation, and is easily testable. I may well be wrong, but it is probably worth checking.

gbrooks9 · February 21, 2018, 6:46am

@Lynn_Munter

If you read my posting carefully, it is a hypothetical that I am challenging @Swamidass with.

As best as I can tell, you are arguing on my side of the discussion in your posting above.

gbrooks9 · February 21, 2018, 6:56am

@Swamidass

Maybe you need to be more careful with the syntax?

Your statement that “there is ZERO evidence that humans do not begin as a single couple” is a statement embedded in the YEC context. And yet you try to avoid the YEC context by stripping your assertion of any chronological time frame!

All @DennisVenema has to do is demonstrate that the human population absolutely did not originate as a single couple at any time within the last 10,000 years, and he can be impressively certain that the YEC claims are false.

In contrast, @Swamidass, by your not being specific enough in your choice of words (for example, by intentionally excluding a time frame), you introduce confusion into what it is you think you are proving - - or at the very least, what it is you think you are proving to others.

Swamidass · February 21, 2018, 8:26am

You missed this.

Lynn_Munter · February 21, 2018, 3:05pm

I read your posting about 3 times because I couldn’t make sense of it. Having just read it again, I believe you were not reading Swamidass in the sense he meant to convey: that humans could have begun as a single couple (within a larger hominin population) or at least that there is zero evidence that they did not, within the conventionally accepted time range for the origin of humanity.

If it helps, you are hardly alone in this grammatical confusion!