Human Chimp Genome Similarity

glipsnort · June 19, 2018, 1:03am

RichardBuggs:

The authors aligned PanTro4 with Hg38 using MUMmer. They write (emphasis added):

“MUMmer had 2.782 Gb of the sequence in mutual best alignments, where each location in the chimp was aligned to its best hit in human and vice versa, with an average identity of 98.07%. The 1.93% nucleotide-level divergence found here is higher than the 1.23% reported in the original chimpanzee genome paper [25]. Our higher divergence is likely due to two factors: first, the 2005 report was based on 2.4 Gb of aligned sequence from older versions of both genomes, while ours is based on 2.782 Gb (16% more sequence) aligned between the current, more-complete versions of both genomes. Second, the original report used different methods, and may have counted fewer small indels than were counted in our alignments. Approximately 306 Mb (9.91%) of the human sequence did not align to the chimpanzee sequence, while 138 Mb (4.15%) of the chimpanzee sequence did not align to human. We detected 390 Mb in alignments where multiple sequences from chimpanzee aligned to the same location in human sequence and thus only one was chosen as the best alignment based on alignment identity.”

The important thing to note here is how different the alignment success was going in the two directions: 90% of the human genome aligned to chimpanzee, while 96% of the chimpanzee genome aligned to human. Now this could mean that humans have an extra 6% of their genome that is unique to our species, but it is far more likely that the lower success represents the imperfect nature of the chimpanzee assembly that the human is being aligned against. This is the obvious conclusion, and presumably the source of the 96% in the paper’s abstract.

Now, of course it’s possible that if the equivalent 6% of the chimpanzee genome were better assembled, it would show little similarity with human DNA and that alignment in both directions would only be about 90% successful. But I see no reason to think that that would be the case.

It’s appropriate to have doubts. If you want to draw conclusions about the overall identity of human and chimpanzee DNA, you should have some kind of understanding of how good your procedure is at estimating that quantity, especially since the procedure in question wasn’t designed for this purpose and since nobody here knows much about it. Most of the DNA you’re interested in – in particular the segmental duplications – lie in the trickiest parts of the genome to align, and the tricky bits can be quite tricky. Recent segmental duplications are nearly identical in sequence, and yes, I’d like to know how often the alignment pipeline picks the wrong copy, thereby breaking the best reciprocal alignment.