Human Chimp Genome Similarity

glipsnort · May 15, 2018, 10:18am

Hi @RichardBuggs. My main concern is something I already asked about, albeit somewhat tersely. As I recall, the main discrepancy between your current estimates of identity and the numbers in the chimp genome paper is the roughly 5% of the genome that looked like structural variation. I assume these are human regions that align best to a location in the chimpanzee genome whose best alignment is somewhere in the human genome other than the starting point.

Something like 5% to 10% of the human genome is known to show copy number variation; these regions are notoriously difficult to call correctly in individual genomes and are a known source of incorrect variant calling (e.g. this paper). Build 38 of the human reference genome contains substantial regions with alternate assemblies to describe some of this variation (I think amounting to 2% of the genome, but I could be mistaken). How did the alignment procedure you are using handle these regions? More broadly, to set an upper limit on the identity of the two genomes, you have to either allow for errors in assigning the correct location of reads from variable regions, or show that the error is negligible. Given the state of the genome assemblies and the alignment procedure used, what is your estimate of this error rate, and how did you make that estimate?

My numbers come from the same supplemental note. The 87% comes from this: “The 37,931 chimpanzee scaffolds comprise 2.73 Gb of sequence and span 3.109 Gb of the genome. Of these, a total of 33,180 (2.70 Gb of sequence spanning 3.077 Gb) scaffolds had significant alignments to the human genome.” That is the summary statement of the amount of chimpanzee sequence that aligned to the human reference. The 10% of the assembly that they thought was wrong was the fraction removed in the steps described in the following paragraphs, which lead to this summary statement: “The total anchored sequence after these steps dropped to 2.74 Gb (2.41 Gb of actual contig length), or 88% of the total chimpanzee sequence.” Some of that .3 Gb that was removed in these steps may represent genuine structural differences between the genomes, but they thought it was bad assembly, and they were likely right for the bulk of it.

I honestly don’t know what the 94% figure means exactly. My guess is that it expresses the fraction of the genome that was covered by aligned contigs, but it clearly doesn’t represent the total number of bases actually aligned, since that number is given above. Note that elsewhere in that supplemental note, the “base coverage” (not “coverage”) of coding regions on chromosome 21, which they used to check their assembly and alignment, is given as 89%: “The WGS chromosome 21 sequence yielded a mean base coverage of the human coding regions of 89.3%.” Since coding regions are likely to be better covered than elsewhere in the genome, the total base coverage is likely to be lower than this estimate.