Is this what is referred to as an academic question?
Well, I’m still waiting for you to propose a hypothesis for a sudden bottleneck to two hominins at ~700 KYA followed by explosive growth that is not miraculous, but as the song goes, you can’t always get what you want.
But I think you’re being a bit pedantic here. Your critique only really works if you press the word “entire” to mean “absolutely every nucleotide”. Maybe “genome-wide” would have been a better choice, but “entire” works just fine as well. And yes - 95% is the best estimate we have for the genome-wide identity of chimps and humans if you count indels on a per-nucleotide basis. That’s the published value based on the largest sample we have, and you’ve still given us no reason to suppose that the bits left over that haven’t been chased down to the nth degree are going to be significantly different than that value. Even if they are, it will most likely be due to indels of highly repetitive DNA, which is really hard to account for in any case.
But back to “entire”. If you google “entire” for dictionary definitions, you’ll see things like this:
The whole of, without missing any part. E.g. “I’ve traveled the entire world.” Not so! says the pedant. You may have travelled to every country, but have you visited every city? Every town? Every street in every town? Every house on every street? Every room in every house? Every square foot of every room?
Clearly this gets a bit silly.
Or take this recent headline:
“Pompeo: Iran will face ‘wrath of entire world’ if it pursues nuclear weapons.”
Not so! says the pedant. North Korea will probably approve, for that matter lots of Iranians will as well. There’s a guy in Albania who’s in favour as well. Clearly claiming the entire world will be wrathful is off base.
You also omitted my summary statement from your quote from Adam and the Genome:
“No matter how you slice it, the human and chimpanzee genomes are nearly identical to each other.”
That is what you have to deal with, even if you decide you want to slice it a little differently. That and all the other evidence for common ancestry. If you want to oppose common ancestry of humans and chimpanzees, it’s going to take a lot more than quibbling over the precise percent identity between the genomes.
Not really, in the sense that it’s not a question an academic scientist is likely to ask. “What is the rate of single-base substitutions between the two genomes?” is an academic question, as is “What fraction of the human genome represents unique sequence not present in the chimpanzee genome?” In contrast, “What is the overall identify between the two genomes?” is ambiguous and doesn’t correspond to any obvious scientific question.
Suppose there has been a recent duplication of a million base pair segment in humans, with both copies now ~2.5% different from a single chimpanzee region, and 0.2% different from each other. Is that a million bp of unique sequence in the human genome? What scientific question are you trying to answer with this comparison?
As I mentioned up there ^^ long ago, the singular focus of trying to maximize the differences between humans and chimpanzees and minimize the “% identity value” has only ever been a tactic to attempt to reduce confidence in common ancestry.
It’s for this reason that I find the lack of engagement on the other myriad genomic evidences for common ancestry disingenuous on this point. If it’s really common ancestry we’re discussing, then let’s discuss the whole breadth of the evidence.
It’s been a lot harder to be an antievolutionist since we started sequencing genomes.
Okay. . .
I hesitate to conclude that it is now 5%. The MUMmer4 paper concluded that 4% of the chimpanzee genome did not align against the human, which looks quite consistent with the sum of 2.7% from segmental duplications and 1.5% from smaller indels (from the original chimp paper). Aligning the human against chimpanzee of course gives a smaller aligned fraction, but that’s inevitable when aligning the more complete against the less complete genome. (Now, the reciprocal LASTZ alignments may give a different value, but that kind of dependence on method would make any conclusions quite tenuous.)
Give their stated doubt that they could assemble structural arrangements correctly, I think they are eliminating a class of alignment that includes incorrect assembly as well as portions of the genome that they don’t think they can assemble correctly. That is not exactly what I said previously.
95% seems like a defensible value based on the MUMmer4 paper, which claims 98% identity across 96% of the genome. As I argued above, I think the 98% is probably an underestimate and should be 98.8%, which gives 95% as an overall value.
One of the things in Adam and the Genome that I try to underscore is that the precise numbers are probably going to shift around as new technologies are brought to bear on these questions. It’s for this reason that I say things like “about 95%” and so on. I did the same for the ancestral effective population numbers (Ne). But changes in the precise numbers are not likely to invalidate the general consensus - we evolved, and we did so as a population.
No one is more interested in the “% genome identity” thing than folks trying to cast doubt on common ancestry. It’s just not a precise value that scientists are interested in, because it doesn’t answer interesting scientific questions in the way other values do (as you’ve pointed out).
That last sentence is everything anyone needs to know about this conversation.
Here is a final summary of my position.
“How similar are the human and chimpanzee genomes?” is a relatively straightforward scientific question. We are hindered by the still somewhat incomplete nature of both the human and the chimpanzee reference genome assemblies, but we can make this clear in our assessments and allow for the uncertainties that it raises.
The best way to assess the similarity of two genomes is to take complete genome assemblies of both species, that have been assembled independently, and align them together. The alignment process involves searching the contents of the two genomes against each other. Parts of both genomes that are too different to match one another will be absent from the alignment, unless they are very short, in which case they will be included as “indels” (longer indels, even if they have well characterised flanking sequences, will be absent from the alignment). Within parts that do align, there will be some mismatches between the two genomes, where one or a few nucleotides differ, which in this discussion we have been calling “SNPs”. In addition there will be some parts of each genome that are present twice or multiple times in one genome and are present fewer times in the other genome. We have referred to these as “paralogs” or “copy number variants” (CNVs). To come up with an accurate figure of the similarity of the entirety of two genomes, we need to take into account all these types of difference.
For some purposes, when talking about the similarity between two genomes we may want to just focus on one type of difference, such as SNPs. If we do this, we should always specify which types of difference we have and have not taken into account. The most well-known estimates for the similarity of the human and chimpanzee genomes only take into account SNPs and small indels. Copy number variants are less often included, and regions of the two genomes that do not align are commonly ignored.
When assessing the total similarity of the human genome to the chimp genome, we also need to bear in mind that roughly 5% of the human genome has not been fully assembled yet, so the best we can do for that 5% is predict how similar it will be to the chimpanzee genome. We do not yet know for sure. The chimpanzee genome assembly is less well assembled, so in future we may assemble parts of the chimpanzee genome that are similar to the human genome - this is another source of uncertainty to keep in mind.
To come up with the most accurate current assessment that I could of the similarity of the human and chimpanzee genome, I downloaded from the UCSC genomics website the latest alignments (made using the LASTZ software) between the human and chimpanzee genome assemblies, hg38 and pantro6. See post #35 above for details. This gave the following for the human genome:
4.06% had no alignment to the chimp assembly
5.18% was in CNVs relative to chimp
1.12% differed due to SNPs in the one-to-one best aligned regions
0.28% differed due to indels within the one-to-one best aligned regions
The percentage of nucleotides in the human genome that had one-to-one exact matches in the chimpanzee genome was 84.38%
In order to assess how improvements in genome assemblies can change these figures, I did the same analyses on the alignment of the older PanTro4 assembly against Hg38 (see post #40 above). The Pantro4 assembly was based on a much smaller amount of sequencing than the Pantro6 assembly (see post #39 above). In this Pantro4 alignment:
6.29% had no alignment to the chimp assembly
5.01% was in CNVs relative to chimp
1.11% differed due to SNPs in the one-to-one best aligned regions
0.28% differed due to indels within the one-to-one best aligned regions
The percentage of nucleotides in the human genome that had one-to-one exact matches in the chimpanzee genome was 82.34%.
Thus the large improvement in the chimpanzee genome assembly between PanTro4 and PanTro6 has led to an increase in CNVs detected, and a decrease in the non-aligning regions. It has only increased the one-to-one exact matches from 82.34% to 84.38% even though the chimpanzee genome assembly is at least 8% more complete (I think) in PanTro6.
The PanTro4 assembly has also been aligned to the human genome using the software Mummer 4 (reported in: Marçais, Guillaume, et al. “MUMmer4: A fast and versatile genome alignment system.” PLoS computational biology 14.1 (2018): e1005944). This method gives broadly similar figures to my analyses of the UCSC LASTZ alignments. MUMmer places 2.782 Gb of the sequence in mutual best alignments, and the total length of the LASTZ alignment is 2.761Gb. In the MUMmer analysis approximately 306 Mb (9.91%) of the human sequence did not align to the chimpanzee sequence in mutual best alignments. This fits well with the LASTZ result of 6.29% non-aligning plus 5.01% CNV = 11.30% not aligning. Overall, the MUMer software has been slightly more generous in aligning the human and chimp genomes, but as @glipsnort has pointed out, MUMer is giving a higher estimate of SNP differences within its alignments. This is probably a signal that it has over-aligned the two genomes and some of its alignments are spurious. Thus I think we are best off trusting the LASTZ alignment over the MUMer alignment, though the difference between the results of the two methods is rather small.
As 5% of the human genome is still unassembled, and 5% seems to be CNVs relative to chimp, and 4% is unaligned to the chimp genome, I cannot agree with @DennisVenema and @glipsnort that “95% is the best estimate we have for the genome-wide identity of chimps and humans”. I would accept 95% as a prediction, but not as a statement of established fact.
I predict that the 95% figure will prove to be wrong, because (on the basis of my comparison of the PanTro4 and PanTro6 alignments to Hg38) I think that the CNV differences are here to stay, and I doubt that all of the currently unaligned or unsequenced regions of the human genome will prove to all be 95% the same as the chimpanzee genome. Some of the “unaligned” human sequences are medium-sized indels, and it is hard to see why they would not have been assembled in the chimp if they were present. I also expect at least some of these unaligned or unsequenced sequences to be rapidly evolving.
In 2008 I wrote “I predict that when we have a reliable, complete chimpanzee genome, the overall similarity of the human genome will prove to be close to 70% (and very far from 99%).” This prediction is not borne out by the more recent data above. I made a mistake in my 2008 calculations in the way in which I dealt with CNVs, which put me out by 2.7%, but this was only a minor component of why my estimate was so low. The main reason why my estimate was so low was because I thought that the 2005 chimpanzee genome assembly was far more complete than it actually was. This was because the authors claimed in the main text of the chimpanzee paper "the draft genome assembly…covers ~94% of the chimpanzee genome with >98% of the sequence in high-quality bases.” Thanks to discussion in this thread with @glipsnort (see post #62, #63 and others above), who was one of the authors of the 2005 chimp genome paper, I can now see that the 2005 draft genome assembly was not as good as this claim suggested. However, in 2008 I did not know this, and my prediction was made in good faith on the basis of my understanding of the 2005 paper.
Thank you all for an interesting discussion. Please accept this as my final summing up and closing statement.
Thank you for your time and efforts. Please feel free to come back and contribute your perspective on other topics, as well. I would be interested to hear your thoughts on something other than your specialty. Of course, I’m assuming that you do think about things besides genetics once in a while …
I have a feeling that if the 95% does prove out Dr. Buggs would still argue that it doesn’t indicate common ancestry.
Neither the chimp nor the human genome are completely assembled, so that would be the first major problem. There are gaps in each alignment, and those gaps will be in different places in each alignment. This means that a lack of a match between the genome assemblies could simply be a gap in the alignment in one of the genomes.
It would be interesting to see the results for the same comparison of the chimp and gorilla genomes. It is the pattern of similarity that evidences common ancestry and evolution, not a set percentage. If there are more differences between the chimp and gorilla genome, then what?
You probably know that on the topic of Human/Chimp genome comparison, I’m not exactly an “easy sell”.
But your “Final Summary” is impressive in its conciseness, its frankness and even in its elegant simplification of highly complex data.
I celebrate your most important sentence!:
“The percentage of nucleotides in the human genome that had one-to-one exact matches in the chimpanzee genome was 82.34%.”
Intuitively speaking, this is seems to be the best way most of us can tune into this issue:
by comparing “one-to-one exact matches in the genome”. This might not be the best approach for all purposes, but it is certainly an excellent benchmark to start with. If anyone reads anything in this thread, it should be your Final Summary!
Best wishes, George Brooks
Good to know this.
Also good to know when comparing program P1 ‘build a chimp’ and program P2 ‘build a human’, is whether P1 and P2 only differ in their parameters or also in their dimensions. If the difference is only in the parameters, P1 can change into P2 by random variation of its parameters and selection. If the difference is also in the dimensions, billions of variations of the parameters during billions of years cannot produce a change in dimensions (= second order change/ transformation / innovation). The progress of science in the ENCODE-project (see: https://www.encodeproject.org/ ) will reveal eventually the type of differences between P1 and P2. Notice that change of P1 and P2 in their dimensions is antagonized by mutation repair systems. (see: Can mutations produce mutation repair systems? )
Dr. William DeJong
This might be a reasonable question if genomes were programs or had parameters and dimensions.
But WilliamDJ says it does so that must mean it is true.
Comparing DNA to a program is not a bad analogy but at a fundamental level it breaks down.
It’s an okay analogy for some purposes, terrible for others. (And programs don’t have dimensions either.)
Are you hijacking the thread with the mutation repair Red Herring? 1. If there are still mutations despite mutation repair systems, and if mutation repair systems are genetically established, I really can’t see the point of bringing back that old saw horse to a topic that won’t benefit a bit by its introduction!