Hi @glipsnort
a) Current estimates
Thanks for your response. The “maximum” estimate I made of similarity between the human and chimpanzee genomes is heavily affected by structural variation (which I call “Paralogs in the axtnet alignment” in the tables I presented above), so I am very happy to have a careful think about whether this number may be inaccurate.
The main chimp genome paper did not quantify differences due to structural variation. These were examined in 2005 in a separate paper here, where the authors concluded: “Nevertheless, base per base, large segmental duplication events have had a greater impact (2.7%) in altering the genomic landscape of these two species than single-base-pair substitution (1.2%).” So the challenge for me is to explain why this figure now seems to be 5% rather than 2.7%. I think this is because improvements in both the human and chimpanzee genome assemblies have led to a better assembly of copy number variants. It may also be because the 2005 paper on this issue only looked at larger CNVs.
Yes, that is my assumption also.
I assume it used the primary sequence and did not include the alternatives. I think this would be a reasonable approach.
I think you are asking me to come up with: (1) an estimate of the error rate in the human genome assembly that is due to mis-placed reads, and (2) the corresponding figure for the chimpanzee assembly, and (3) the error rate in the alignments of the two that is due to incorrect placement of paralogs. That is a pretty big task, and I don’t know if it is even possible. All I can do is give figures for what the most up to date assemblies and alignments are saying, and give caveats about revisions that could come in the future as we get more and more accurate assemblies and alignments.
As I mentioned above, I had wondered if the estimated difference between humans and chimps due to CNVs would go down as the chimpanzee genome improved, but in fact it has kept increasing.
If it is number (3) that is you main concern, I don’t think this would have a major effect on the overall figures. Misplaced one-to-one reciprocal best alignment are most likely to occur where the repeat copies are highly similar, so the only effect of a misplaced one-to-one reciprocal best alignment will be a slight over-estimate of the number of SNPs or indels within that alignment. It won’t make a big difference to the estimate of difference due to CNVs, as far as I can see.
b) What was known about the chimpanzee genome in 2008
Earlier you commented to me about my analysis of the 2005 chimpanzee genome paper:
Now you have added:
Thanks for pointing me to the “Creation of Chimp AGP Files” section of the supplement. I had not looked at this very closely before, as this part of the study was not mentioned in the main paper itself. I had assumed that the main purpose of this particular supplemental section was to order the chimpanzee scaffolds on to the human assembly, so that the chimpanzee assembly could be pseudo-chromosomal.
As you say, this supplement section gives a rather difference picture the prominent claim in the main text: “The draft genome assembly…covers ~94% of the chimpanzee genome with >98% of the sequence in high-quality bases”. Given the prominence of the latter claim, I not surprised that when I wrote about this paper in 2008 I thought that the 2005 paper presented a higher quality chimpanzee genome assembly than it actually does.
I am not sure I can agree with you on this point. The authors responsible for this section do not say that these parts of the assembly were wrong. Their aim seems to be to come up with a higher-level scaffolding of the chimpanzee genome using one-to-one reciprocal best alignments with the human genome. The parts of the chimpanzee assembly that they don’t include in this are copy number variants, scaffolds that align partly to one human chromosome and partly to another, and a few other categories. The authors don’t seem to be suggesting that these were misassembled in the chimpanzee assembly - it is just that they don’t fit in well with an approach to super-scaffolding that assumes that the human and chimpanzee genomes contain no structural rearrangements.
Did the authors of this part of the study tell you privately that they thought that the parts of the assembly that they discarded from their alignment with the human genome were wrong?
Even if they did think that these parts of the assembly were wrong, their grounds for thinking so were because these parts did not align to the human genome as well as they expected. If we follow that reasoning when trying to estimate a total percentage similarity between the human and chimpanzee genomes, we introduce a degree of circularity.
I have just noticed this at the bottom of page six of the supplement: “We estimate the genome coverage to be about 94%, based on comparison to 12 finished CHORI-251 BAC clones. These clones collectively comprise a total of 1,265,617 bases of sequence. Table S11 shows detailed information of these comparisons. Specifically, ARACHNE covers 1,186,774 bases, or 93.8% of the clones, while PCAP covers 1,189,836 bases, or 94.0% of the clones.”
So I guess that this is where the 94% figure came from. It is a tiny sample size. I have to say, I find your 87% figure more convincing as a genome-wide estimate.