Human Chimp Genome Similarity

RichardBuggs · April 23, 2018, 8:12pm

Here is my response to Steve Schaffner’s comments, which he kindly posted earlier based on a short extract from my first 2008 article.

The first contribution by @glipsnort is in a post that has had to remain in the previous thread, so I will quote the relevant part here in italics and interleave my responses:

@glipsnort: "Not exactly my area; I was an author of the chimpanzee genome paper, but my contribution was in modeling selection, not in the sequence comparison. On the other hand, I am pretty familiar with the paper, and it’s safe to say that the quoted text is completely wrong. Similar claims have been repeatedly introduced into Wikipedia, where they have have had to be weeded out.

I am very grateful to have a response to my old articles from you, Steve.

I was not aware that these figures had been repeatedly introduced to wikipedia and repeatedly weeded out. I think that the reason this happened was not because I had said anything from any position of authority, but because I (or perhaps others, if others spotted it) pointed something out that people could see from themselves in the figures given in the chimp genome paper. The paper clearly said how long the alignment was that the authors had made between the human and chimpanzee genomes. It was not at all obvious to the scientifically-literate reader what the reason was for the reason for non-aligning regions (apart perhaps from the reasons I pointed out in my 2008 articles). I am have to admit that I disappointed that references to this were apparently “weeded out” from Wikipedia rather than corrected with adequate and clear explanations of what was going on with the unaligned regions (please correct me if this was done - if it was I was not aware of it). If this had been done, this might have dealt with it more effectively.

@glipsnort: Short summary of the actual comparison:
2700 million base pairs (out of a total of roughly 3100 million bp) of the chimpanzee genome was sequenced well enough to be compared. That is the portion we can say something about. Of that 2700 million, 2400 million could be aligned to the human genome. This portion is the basis for the conclusion that 1.23% of sites in shared DNA differ by a single-base substitution, and that another ~1.5% was unique to each genome.

I fully agree with these figures, and believe I cited them correctly in my articles.

@glipsnort: (Based on these numbers the most reasonable single statement of overall similarity is that approximately 97.3% of the human genome is identical to the chimpanzee genome.)

I think that is only a reasonable statement if it is accompanied by the caveat that this is based on only 2.4Gbp of the human genome. It is not good to cite a percentage without giving a sample size. Especially if not to do so is to imply that one’s sample size is the whole population when it is not. This is the heart of what this debate is about. I think one thing we can probably all agree on is that a single figure statement is not adequate alone, and needs to be accompanied by further explanation.

@glipsnort: The remaining 300 million base pairs of chimp DNA was sequenced but was not compared. 240 million bp were left out because they aligned to multiple places in the human genome. Much of this (I don’t know exactly how much) was the result of badly assembled chimp DNA, while some may represent genuine duplications in the human lineage. Another 90 million bp didn’t align to human at all; again, most of this was probably garbage of various kinds – badly assembled chimp DNA, parts of the human genome that hadn’t been assembled, etc."

This tells me what happened to parts of the chimp genome that did not align to the human genome, but does not tell me explicitly what is happening with the parts of the human genome that do not align to the chimp genome, apart from sequences that appeared to have copy number variation in human but not chimp. It is the parts of the human genome that do not align to the chimp genome that my articles were about.

In a later post @glipsnort says:

As I noted above, there was never 24% of the human genome that does not line up with the chimpanzee genome; most of that 24% represents DNA we never assembled in the chimp genome, so there was no comparison to be made.

The alignment of the human and chimpanzee genomes in 2005 only covered approx 76% of the human genome, so there was approx 24% of the human genome that did not align to the chimpanzee genome assembly. I am not quite sure what you are saying here, Steve. Are you saying that sequence that could align to the remaining 24% of the human genome is all actually in the real chimpanzee genome, and was sequenced but not assembled in 2005? If so, what evidence can you cite for this? If that is not what you are saying, please could you spell out your point more clearly?

@glipsnort: Who were the scientists who argued that this was junk DNA?

I did not find these unaligned regions being discussed anywhere in the literature (if they were, and I missed it, I would be very grateful if you could point me to relevant citations). This statement was based on conversation I had with scientists about the unaligned regions. It was not an entirely unreasonable suggestion as proliferation of a novel retro-element could generate a large amount of unalignable sequence relatively quickly and easily.

In conclusion, Steve, unless I am missing something, or misunderstanding something, you have not explained why, in 2005, I was “completely wrong”.

AMWolfe · April 23, 2018, 8:26pm

I dunno. I think for as far back as I can remember, I have always understood claims about human-chimp similarity to just mean that our DNA is similar. I don’t start projecting that into a diachronic frame of reference, thinking about things passing unchanged or changed or whatever over generations or evolutionary time scales. I just think about similarities, full stop.

And while I do sympathize with Dennis in general in this conversation, I’m really trying my best to give an honest account of how I would have understood human-chimp similarity claims at various times of my life. I think my naive interpretation of such claims has pretty much always been what I just said.

glipsnort · April 24, 2018, 12:57am

I wasn’t suggesting that you were responsible for what was put into Wikipedia (that didn’t even occur to me, actually). Wherever it came from, though, it was wrong.

I don’t remember what was left in place of the erroneous material. Removing error is a higher priority than getting a complete account in, however.

You did not. You said that 94% of the chimpanzee genome had been sequenced, when in reality only 87% had been sequenced (if the reads haven’t been assembled, the sequencing hasn’t been done), and in most of another 10% the sequence was almost certainly wrong.

A sample size of 2.4 billion base pairs is an enormous sample size. Sure, the 2.4 billion could be a biased sample of the 3.1 billion in the entire genome as far as species divergence is concerned, but you’ve given no reason to suspect that is. Had you simply said that caution was needed, since a good sized chunk of the genome was still unexamined, and things might be different there, no one would have objected. But that’s not what the debate is about; the debate is about what you did write, which was something quite different.

Here’s what I’m saying, and I’ll try to be as clear as possible. You wrote, “When we do this alignment, we discover that only 2400 million of the human genome’'s 3164.7 million ‘‘letters’’ align with the chimpanzee genome.” In that sentence (especially given the context), you are making a statement about the physical human genome; you are framing it as an answer to the question, “How similar our DNA is to chimpanzee DNA?” The numbers you quote, though, are answering a very different question: “What fraction of the chimpanzee genome did we actually sequence successfully?” “We successfully sequenced somewhere between 77% and 87% of the chimpanzee genome” is not an answer to the question, “How similar is our DNA?” Any details you got wrong are minor compared to this fundamental error.

The similarity of human and chimpanzee DNA is a real fact about the world (one that can be quantified in multiple ways). It is not a value that changes with the amount of sequencing we’ve done. Humans and chimpanzees were not 1% identical in their DNA when we had sequenced 1% of the chimp genome, and we did not become 70% identical when we had sequenced 75% of the genome.

RichardBuggs · April 24, 2018, 8:58pm

Hi Steve,

I have had to run a few errands this evening, so don’t have time to respond to all the points you raised in your last post.

I agree with you that a major reason for the large amount of human unaligned regions in 2005 was due to imperfections in the chimpanzee genome assembly. We can see this now clearly with hindsight, as the current chimp genome assembly gives a longer alignment to the human.

There should become a point, though, when the chimpanzee genome is well enough assembled for us to get a reliable figure for the total similarity of the human and chimpanzee genome, in a manner that includes regions of the human genome that genuinely have no orthology to the chimpanzee genome (of which I am sure you would agree there must be some).

Using alignments available at UCSC, I used a perl script to calculate total genome similarity between the human and chimpanzee genome, including, SNPs, indels and unaligned regions. Here are the results I get for reciprocal best alignments with successively better chimpanzee genome assemblies:

hg38 v panTro3 82.3%
hg38 v panTro4 82.3%
hg38 v panTro5 84.2%
hg38 v panTro6 84.4%

I am very happy to share my script for you to check if you want, or perhaps you would prefer to do your own independent analysis.

It would appear that improved chimpanzee assemblies between versions 3 and 6 of the chimpanzee genome have only yielded small decreases to the non-aligned parts of the human genome.

T_aquaticus · April 24, 2018, 9:49pm

I would say that 99% isn’t even wrong, to quote Wolfgang Pauli. Any percentage of similarity needs some context. Is it 99% within genes? Does it mean that we have 99% of the same homologous genes? Do we count a 5 base indel as a single mutation or a difference of 5 bases?

I have always found it useful to use a sample sequence to illustrate the difficulties in describing sequence comparisons when talking to non-scientists. For example:

chimp----ATATGGCGCGAGATTCTAGATGGGCCC
human—ATATGGTGCGAGXXXCTAGATGGACCC
(using X to signify gap)

So there is a single 3 base indel and 2 substitutions in 27 bases total. So how do we describe this? We could ask how many bases have changed in the DNA that we share through common ancestry, and that answer would be 22 out of the 24 bases (i.e. ignore the indel). That is an entirely valid way of describing the comparison as long as you say what you are comparing. We could also ask how many mutations separate us, and the answer to that is 3 mutations in 27 bases, so 24 out of 27. We could also ask for the total number of different bases, and that would be 22 out of 27.

Perhaps in showing how comparisons aren’t a straightforward answer it will give the public a better understanding of what these comparisons really look like. In the long run, it may not be the actual percentage that makes the most impact on the public. The fact that chimps share more DNA with humans than they do with other apes may cause some light bulbs to go off in a few heads.

T_aquaticus · April 24, 2018, 10:09pm

But how can you say that there is not orthology when parts of the human and chimp genome are not part of that comparison? That seems to be the error you made from the very start.

To use an analogy, we have two incomplete jigsaw puzzles. One puzzle has 90% of the pieces put together, and the other puzzle also has 90% of the puzzle pieces put together. The problem is that it is not the same 90%. There are sections where there are pieces in one puzzle but not in the other. It is easy to see that pieces are nearly the same in the same positions, and the picture on each piece differs by 1.5% where there are pieces in both puzzles at the same position.

Now, would it be correct to say that the sections where there are not pieces in both puzzles are 0% identical? Of course not, right? You need the actual pieces in place before you can compare them. More to the point, why would you ever think that when you start filling out the remaining 10% in each puzzle that the pieces won’t be 1.5% different just like the rest of the puzzle pieces?

TGLarkin · April 24, 2018, 10:54pm

Thank you all for the great discussion. No percentage of commonality would ever prove or disprove a common ancestor, wouldn’t this be more dependent of the DNA sequence itself, such as common location of pseudogenes (nonfunctional genes), ancient repetitive elements (AREs), etc.?

Joel_Duff · April 25, 2018, 1:24am

All very interesting discussion but in the end the numbers, 70%, 86%, 98%, 98.6% don’t mean much without some sort of context. As. T. aquaticus has said above we need to know what we are comparing. But I would go a step further. Even if we agree what we are comparing what does the values we get at the end really tell us?
What I’ve been asking for (especially from YECs who taut the 70% or 85%) is an apples to apples/pairwise comparison. If a chimp and human are 85% using a particular algorithm why can’t that person use that same algorithm to tell us what the difference is between a neanderthal and human or Han Chinese and a African Bushman? I believe Joshua Swamidass has produced a similar comparison of mouse and rat and they are far more different using the same criteria than a human and chimp. We need some numbers to compare to to make sense of 86%. That sounds like a substantial difference (and that is always the point of the anti-evolutionist) but what if you an I are only 94% similar by the same measure are we going to start saying that I am not human but you are?

gbrooks9 · April 25, 2018, 3:43am

I’m glad that some participants are noting that this “70%… 98%” topic is not exactly the most persuasive lever of opinion on the matter of evolution.

But what is persuasive is being able to show how specific broken genes are shared between chimps and humans … and virtually no other animal groups.

T_aquaticus · April 25, 2018, 3:53pm

First and foremost, I think it would be dependent on the phylogenetic signal in the genetic and morphological data.

One could argue that too large of a gap between two genomes could not be explained by the accumulation of random mutations over a relatively short period of time (5-10 million years for chimp/human), but that certainly isn’t the case here. It then comes down the relative differences between species and the consistency of the data with the tree predicted by the theory of evolution. Phylogenetic signal is still the most basic and powerful piece of evidence for the theory.

You also mention repetitive elements, and they fit right into this paradigm. One great example are the LTRs found in endogenous retroviruses.

“Third, sequence divergence between the LTRs at the ends of a given provirus provides an important and unique source of phylogenetic information. The LTRs are created during reverse transcription to regenerate cis-acting elements required for integration and transcription. Because of the mechanism of reverse transcription, the two LTRs must be identical at the time of integration, even if they differed in the precursor provirus (Fig. 1A). Over time, they will diverge in sequence because of substitutions, insertions, and deletions acquired during cellular DNA replication.”
http://www.pnas.org/content/96/18/10254

When you plug these repetitive elements into algorithms that construct trees you get the expected species tree with very few exceptions. You can check them out in this figure from the paper linked above.

DennisVenema · April 25, 2018, 4:13pm

The goal in attempting to reduce the % difference has always been an attempt to cast doubt on common ancestry.

What I have noticed with YEC and anti-common descent ID is that there is seldom (if ever?) an attempt to make sense of the entire sweep of the data supporting common ancestry. We see attempts to diminish the % identity. We see attempts to show that a few rare pseudogenes have been exapted and now have a new function. We see aspersions cast on incomplete lineage sorting, and so on - but we don’t see a coherent case that explains the data better from an antievolutionary perspective. We don’t even see articles that tackle these diverse lines of evidence in the same article, lest the readership notice the problems, I guess.

The one counterexample I know is Todd Wood’s 2006 article, and he concludes that YECs don’t have a good explanation. That’s it.

Jonathan_Burke · April 25, 2018, 6:20pm

Exactly. When we’re dealing with people who say things like this, the aim is clear.

“We are not attacking the teaching of Darwinian theory, we are just saying that criticisms of Darwin’s theory should also be taught” (Teach The Controversy strategy)
“But the fact is that we are still unable even to guess Darwinian pathways for the origin of most complex biological structures” (argument from ignorance)
“More recently, proponents of ID predicted that some “junk” DNA must have a function well before this view became mainstream among Darwinists.” (note “proponents of ID” versus “Darwinists”)
“In fact, ID is a logical inference, based on data gathered from the natural world, and hence it is firmly in the realm of science.” (ID is science)
“I do not know of a good evolutionary pathway for the development of the bacterial flagellum. In his latest book, Professor Richard Dawkins identifies a single possible intermediate step. This hardly constitutes a pathway.” (argument for irreducible complexity)

In summary:

The arguments are presented in the style of an educational film, and are generally presented among needlessly lengthy scientific descriptions and impressive visuals, which help to make creationist arguments sound reasonable to anyone without scientific training in the relevant disciplines. Anyone familiar with creationists will recognize their standard tactics including appeals to emotion, argument from ignorance, misdirection and occasionally blatant falsehoods, – Science, Just Science, October 2006.

T_aquaticus · April 25, 2018, 8:43pm

YEC and ID have always been about creating the thinnest of scientific veneers so that people could feel justified in rejecting evolution. Explaining the evidence has never been their intent.

glipsnort · April 28, 2018, 5:23pm

Since we have already identified 1.5% of the human genome as being unique to our species, of course I agree.

It’s not an analysis I’m interested in.

How much additional sequencing has been done of the chimpanzee genome?

RichardBuggs · May 1, 2018, 8:28am

Hi all,

I have done some more calculations for you to seek to come up with an upper bound of similarity between human and chimpanzee genomes, as well as a lower bound. This is based on the most recent alignments between human and chimpanzee genomes, available at the UCSC genomics website (hg38 versus PanTro6) see Index of /goldenpath/hg38/vsPanTro6

By my calculations, the human genome has between 84.4% and 93.4% one-to-one orthology with the chimpanzee genome. The uncertainty makes allowance for possible current incompleteness in our knowledge of both the human and chimpanzee genomes. The upper bound of 93.4% assumes that all the regions of the human and chimpanzee genomes that we have not yet assembled and/or aligned will prove to have one-to-one orthology between humans and chimps. The lower bound of 84.4% assumes that these regions will prove to be different between humans and chimpanzees. I have assumed throughout that further sequencing is unlikely to significantly alter currently known differences due to SNPs, indels, and copy number variation.

Here is how I did the calculation. I downloaded both the reciprocal best (“rbest”) alignment, and the “net” alignment (this allows copy number variation from UCSC for hg38 and PanTro6). Using custom PERL scripts, I measured the length of each alignment, and the number of SNPs and insertions in each one. I looked up the “Total assembly gap length” in the Hg38 human genome assembly statistics online (only a negligible number of these were present as Ns in the alignments). These are only known rather approximately, and many are estimated to the nearest 1000 or 10,000 bases.

Here are the stats I calculated for each alignment:

I used the differences between the two alignments to work out how much of the net.axt alignment was due to copy number variation, and how many SNPs and insertions there seem to be within copy number variants.

This yielded the following statistics for the overall similarity between the human and chimpanzee genome:

In my view, the upper bound of 93.4% is unlikely to prove to be the true value, once we have complete assemblies for both the human and chimpanzee genomes. This is because the regions of genomes that are hardest to assemble tend to be areas that are very repetitive, or fast evolving, or both. I therefore think it is unlikely that the 4.98% of the human genome that is represented by gaps in the hg38 assembly will prove to be orthologous to the chimpanzee genome. In addition, not all regions of the human genome that currently have no alignment look as if they are highly repetitive (though some are). I think that if these non-repetitive regions were present in the chimpanzee genome they would have been successfully sequenced and assembled by now.

I welcome feedback on these calculations and the methods and assumptions behind them, and especially identification of any errors I may have made.

Best wishes,

Richard

gbrooks9 · May 1, 2018, 2:13pm

@RichardBuggs

Dr. Buggs, what do you hope to demonstrate by having a solid percentage calculated?

Isn’t the Devil in the Detail of how many chromosomal features are shared between humans and chimps… but not as well as other branches of the Great Ape section of the Primate tree?

T_aquaticus · May 1, 2018, 3:30pm

I don’t see how you can make that claim given the fact that the human and chimp assembled genomes are not complete. There are gaps in both genomes where they don’t have good sequence or don’t have enough data to accurately place good sequence within each genome.

Shouldn’t that be closer to 96.5%? That is the figure in the chimp genome paper for orthologous regions. Do you have a reason why this figure would change so drastically?

glipsnort · May 1, 2018, 3:33pm

Are you assuming copy number variants are called correctly in both genomes?

RichardBuggs · May 2, 2018, 9:00pm

Hi Steve,

I have quite a backlog of questions to deal with, and only half an hour free, so I will try to tick off one of the major ones.

In response to my comment:

You said

As far as I can make out from NCBI, PanTro 3 and 4 were based on 6x Sanger genome coverage. PanTro5 had an additional 55x coverage of Illumina overlapping paired 250bp length reads, 2 Lanes of a Chicago library (Hi-C from Dovetail Genomics) and 9x coverage of PacBio long single molecule reads. The total sequence length of PanTro5 is 3,231,154,112bp (ungapped length =3,132,603,083bp), whereas PanTro4 is 3,309,561,368 (2,902,353,696bp ungapped).

I think it is worth noting that although PanTro5 has considerably more data that PanTro4, and is 8% longer in its ungapped length, it has only yielded an increase of one-to-one orthology with the human genome of 1.9% (from 82.3% to 84.2%).

I can’t find anything online about PanTro6. If you have access to information about how this is an improvement on PanTro5, I would be very grateful.

RichardBuggs · May 3, 2018, 8:42pm

Hi all,

To explore further the effect of improvement of the chimpanzee genome assembly on percentage similarity estimates for entire human and chimpanzee genomes, I have done the same calculation for the PanTro4 genome assembly (based on 6X Sanger read coverage) as I did earlier for the PanTro6 genome assembly (based on a lot more sequence data and covering more of the chimp genome - see my previous post).

Here are the two sets of stats side by side:

The improved chimpanzee genome has led to slightly greater precision in my estimates of human chimpanzee percentage similarity: the minimum is raised, and the maximum is slightly reduced.

One thing that I was not necessarily expecting is that the size of copy number variant (CNV) regions seem to have increased in size with the improved chimpanzee genome assembly (as shown by the “Paralogs in axtnet alignment” row). I had wondered if improved chimpanzee coverage might decrease this figure, as repeats in the chimpanzee became better resolved, but this does not seem to have occurred.

As ever, I welcome critiques and suggestions.