Dear all,
Here is a final summary of my position.
“How similar are the human and chimpanzee genomes?” is a relatively straightforward scientific question. We are hindered by the still somewhat incomplete nature of both the human and the chimpanzee reference genome assemblies, but we can make this clear in our assessments and allow for the uncertainties that it raises.
The best way to assess the similarity of two genomes is to take complete genome assemblies of both species, that have been assembled independently, and align them together. The alignment process involves searching the contents of the two genomes against each other. Parts of both genomes that are too different to match one another will be absent from the alignment, unless they are very short, in which case they will be included as “indels” (longer indels, even if they have well characterised flanking sequences, will be absent from the alignment). Within parts that do align, there will be some mismatches between the two genomes, where one or a few nucleotides differ, which in this discussion we have been calling “SNPs”. In addition there will be some parts of each genome that are present twice or multiple times in one genome and are present fewer times in the other genome. We have referred to these as “paralogs” or “copy number variants” (CNVs). To come up with an accurate figure of the similarity of the entirety of two genomes, we need to take into account all these types of difference.
For some purposes, when talking about the similarity between two genomes we may want to just focus on one type of difference, such as SNPs. If we do this, we should always specify which types of difference we have and have not taken into account. The most well-known estimates for the similarity of the human and chimpanzee genomes only take into account SNPs and small indels. Copy number variants are less often included, and regions of the two genomes that do not align are commonly ignored.
When assessing the total similarity of the human genome to the chimp genome, we also need to bear in mind that roughly 5% of the human genome has not been fully assembled yet, so the best we can do for that 5% is predict how similar it will be to the chimpanzee genome. We do not yet know for sure. The chimpanzee genome assembly is less well assembled, so in future we may assemble parts of the chimpanzee genome that are similar to the human genome - this is another source of uncertainty to keep in mind.
To come up with the most accurate current assessment that I could of the similarity of the human and chimpanzee genome, I downloaded from the UCSC genomics website the latest alignments (made using the LASTZ software) between the human and chimpanzee genome assemblies, hg38 and pantro6. See post #35 above for details. This gave the following for the human genome:
4.06% had no alignment to the chimp assembly
5.18% was in CNVs relative to chimp
1.12% differed due to SNPs in the one-to-one best aligned regions
0.28% differed due to indels within the one-to-one best aligned regions
The percentage of nucleotides in the human genome that had one-to-one exact matches in the chimpanzee genome was 84.38%
In order to assess how improvements in genome assemblies can change these figures, I did the same analyses on the alignment of the older PanTro4 assembly against Hg38 (see post #40 above). The Pantro4 assembly was based on a much smaller amount of sequencing than the Pantro6 assembly (see post #39 above). In this Pantro4 alignment:
6.29% had no alignment to the chimp assembly
5.01% was in CNVs relative to chimp
1.11% differed due to SNPs in the one-to-one best aligned regions
0.28% differed due to indels within the one-to-one best aligned regions
The percentage of nucleotides in the human genome that had one-to-one exact matches in the chimpanzee genome was 82.34%.
Thus the large improvement in the chimpanzee genome assembly between PanTro4 and PanTro6 has led to an increase in CNVs detected, and a decrease in the non-aligning regions. It has only increased the one-to-one exact matches from 82.34% to 84.38% even though the chimpanzee genome assembly is at least 8% more complete (I think) in PanTro6.
The PanTro4 assembly has also been aligned to the human genome using the software Mummer 4 (reported in: Marçais, Guillaume, et al. “MUMmer4: A fast and versatile genome alignment system.” PLoS computational biology 14.1 (2018): e1005944). This method gives broadly similar figures to my analyses of the UCSC LASTZ alignments. MUMmer places 2.782 Gb of the sequence in mutual best alignments, and the total length of the LASTZ alignment is 2.761Gb. In the MUMmer analysis approximately 306 Mb (9.91%) of the human sequence did not align to the chimpanzee sequence in mutual best alignments. This fits well with the LASTZ result of 6.29% non-aligning plus 5.01% CNV = 11.30% not aligning. Overall, the MUMer software has been slightly more generous in aligning the human and chimp genomes, but as @glipsnort has pointed out, MUMer is giving a higher estimate of SNP differences within its alignments. This is probably a signal that it has over-aligned the two genomes and some of its alignments are spurious. Thus I think we are best off trusting the LASTZ alignment over the MUMer alignment, though the difference between the results of the two methods is rather small.
As 5% of the human genome is still unassembled, and 5% seems to be CNVs relative to chimp, and 4% is unaligned to the chimp genome, I cannot agree with @DennisVenema and @glipsnort that “95% is the best estimate we have for the genome-wide identity of chimps and humans”. I would accept 95% as a prediction, but not as a statement of established fact.
I predict that the 95% figure will prove to be wrong, because (on the basis of my comparison of the PanTro4 and PanTro6 alignments to Hg38) I think that the CNV differences are here to stay, and I doubt that all of the currently unaligned or unsequenced regions of the human genome will prove to all be 95% the same as the chimpanzee genome. Some of the “unaligned” human sequences are medium-sized indels, and it is hard to see why they would not have been assembled in the chimp if they were present. I also expect at least some of these unaligned or unsequenced sequences to be rapidly evolving.
In 2008 I wrote “I predict that when we have a reliable, complete chimpanzee genome, the overall similarity of the human genome will prove to be close to 70% (and very far from 99%).” This prediction is not borne out by the more recent data above. I made a mistake in my 2008 calculations in the way in which I dealt with CNVs, which put me out by 2.7%, but this was only a minor component of why my estimate was so low. The main reason why my estimate was so low was because I thought that the 2005 chimpanzee genome assembly was far more complete than it actually was. This was because the authors claimed in the main text of the chimpanzee paper "the draft genome assembly…covers ~94% of the chimpanzee genome with >98% of the sequence in high-quality bases.” Thanks to discussion in this thread with @glipsnort (see post #62, #63 and others above), who was one of the authors of the 2005 chimp genome paper, I can now see that the 2005 draft genome assembly was not as good as this claim suggested. However, in 2008 I did not know this, and my prediction was made in good faith on the basis of my understanding of the 2005 paper.
Thank you all for an interesting discussion. Please accept this as my final summing up and closing statement.
Best regards,
Richard