Human Chimp Genome Similarity

RichardBuggs · May 14, 2018, 9:05am

Hi @glipsnort, if you have any further concerns about my analyses of the current data, please do let me know.

Meanwhile, I am just trying to trace something down from your earlier critique of my 2008 statements about the 2005 chimpanzee genome paper.

Please could you let me know where you got the 87% and 10% figures from? I have had another look at the 2005 chimpanzee genome paper, which says on page 70:

“The draft genome assembly—generated from ~3.6-fold sequence redundancy of the autosomes and ~1.8-fold redundancy of both sex chromosomes— covers ~94% of the chimpanzee genome with >98% of the sequence in high-quality bases.”

This is where I got my 94% figure from (Note: it is for assembled reads). Where do you get an 87% figure? Is it somewhere in the supplements or a follow-up paper? And where do the authors suggest that 10% of the assembly is wrong?

I am very keen to trace the source of your numbers, because I would like to know if I could have known them in 2008.

many thanks,

Richard

glipsnort · May 15, 2018, 10:18am

Hi @RichardBuggs. My main concern is something I already asked about, albeit somewhat tersely. As I recall, the main discrepancy between your current estimates of identity and the numbers in the chimp genome paper is the roughly 5% of the genome that looked like structural variation. I assume these are human regions that align best to a location in the chimpanzee genome whose best alignment is somewhere in the human genome other than the starting point.

Something like 5% to 10% of the human genome is known to show copy number variation; these regions are notoriously difficult to call correctly in individual genomes and are a known source of incorrect variant calling (e.g. this paper). Build 38 of the human reference genome contains substantial regions with alternate assemblies to describe some of this variation (I think amounting to 2% of the genome, but I could be mistaken). How did the alignment procedure you are using handle these regions? More broadly, to set an upper limit on the identity of the two genomes, you have to either allow for errors in assigning the correct location of reads from variable regions, or show that the error is negligible. Given the state of the genome assemblies and the alignment procedure used, what is your estimate of this error rate, and how did you make that estimate?

My numbers come from the same supplemental note. The 87% comes from this: “The 37,931 chimpanzee scaffolds comprise 2.73 Gb of sequence and span 3.109 Gb of the genome. Of these, a total of 33,180 (2.70 Gb of sequence spanning 3.077 Gb) scaffolds had significant alignments to the human genome.” That is the summary statement of the amount of chimpanzee sequence that aligned to the human reference. The 10% of the assembly that they thought was wrong was the fraction removed in the steps described in the following paragraphs, which lead to this summary statement: “The total anchored sequence after these steps dropped to 2.74 Gb (2.41 Gb of actual contig length), or 88% of the total chimpanzee sequence.” Some of that .3 Gb that was removed in these steps may represent genuine structural differences between the genomes, but they thought it was bad assembly, and they were likely right for the bulk of it.

I honestly don’t know what the 94% figure means exactly. My guess is that it expresses the fraction of the genome that was covered by aligned contigs, but it clearly doesn’t represent the total number of bases actually aligned, since that number is given above. Note that elsewhere in that supplemental note, the “base coverage” (not “coverage”) of coding regions on chromosome 21, which they used to check their assembly and alignment, is given as 89%: “The WGS chromosome 21 sequence yielded a mean base coverage of the human coding regions of 89.3%.” Since coding regions are likely to be better covered than elsewhere in the genome, the total base coverage is likely to be lower than this estimate.

RichardBuggs · May 16, 2018, 4:03pm

Hi @glipsnort

a) Current estimates

Thanks for your response. The “maximum” estimate I made of similarity between the human and chimpanzee genomes is heavily affected by structural variation (which I call “Paralogs in the axtnet alignment” in the tables I presented above), so I am very happy to have a careful think about whether this number may be inaccurate.

The main chimp genome paper did not quantify differences due to structural variation. These were examined in 2005 in a separate paper here, where the authors concluded: “Nevertheless, base per base, large segmental duplication events have had a greater impact (2.7%) in altering the genomic landscape of these two species than single-base-pair substitution (1.2%).” So the challenge for me is to explain why this figure now seems to be 5% rather than 2.7%. I think this is because improvements in both the human and chimpanzee genome assemblies have led to a better assembly of copy number variants. It may also be because the 2005 paper on this issue only looked at larger CNVs.

Yes, that is my assumption also.

I assume it used the primary sequence and did not include the alternatives. I think this would be a reasonable approach.

I think you are asking me to come up with: (1) an estimate of the error rate in the human genome assembly that is due to mis-placed reads, and (2) the corresponding figure for the chimpanzee assembly, and (3) the error rate in the alignments of the two that is due to incorrect placement of paralogs. That is a pretty big task, and I don’t know if it is even possible. All I can do is give figures for what the most up to date assemblies and alignments are saying, and give caveats about revisions that could come in the future as we get more and more accurate assemblies and alignments.

As I mentioned above, I had wondered if the estimated difference between humans and chimps due to CNVs would go down as the chimpanzee genome improved, but in fact it has kept increasing.

If it is number (3) that is you main concern, I don’t think this would have a major effect on the overall figures. Misplaced one-to-one reciprocal best alignment are most likely to occur where the repeat copies are highly similar, so the only effect of a misplaced one-to-one reciprocal best alignment will be a slight over-estimate of the number of SNPs or indels within that alignment. It won’t make a big difference to the estimate of difference due to CNVs, as far as I can see.

b) What was known about the chimpanzee genome in 2008

Earlier you commented to me about my analysis of the 2005 chimpanzee genome paper:

Now you have added:

Thanks for pointing me to the “Creation of Chimp AGP Files” section of the supplement. I had not looked at this very closely before, as this part of the study was not mentioned in the main paper itself. I had assumed that the main purpose of this particular supplemental section was to order the chimpanzee scaffolds on to the human assembly, so that the chimpanzee assembly could be pseudo-chromosomal.

As you say, this supplement section gives a rather difference picture the prominent claim in the main text: “The draft genome assembly…covers ~94% of the chimpanzee genome with >98% of the sequence in high-quality bases”. Given the prominence of the latter claim, I not surprised that when I wrote about this paper in 2008 I thought that the 2005 paper presented a higher quality chimpanzee genome assembly than it actually does.

I am not sure I can agree with you on this point. The authors responsible for this section do not say that these parts of the assembly were wrong. Their aim seems to be to come up with a higher-level scaffolding of the chimpanzee genome using one-to-one reciprocal best alignments with the human genome. The parts of the chimpanzee assembly that they don’t include in this are copy number variants, scaffolds that align partly to one human chromosome and partly to another, and a few other categories. The authors don’t seem to be suggesting that these were misassembled in the chimpanzee assembly - it is just that they don’t fit in well with an approach to super-scaffolding that assumes that the human and chimpanzee genomes contain no structural rearrangements.

Did the authors of this part of the study tell you privately that they thought that the parts of the assembly that they discarded from their alignment with the human genome were wrong?

Even if they did think that these parts of the assembly were wrong, their grounds for thinking so were because these parts did not align to the human genome as well as they expected. If we follow that reasoning when trying to estimate a total percentage similarity between the human and chimpanzee genomes, we introduce a degree of circularity.

I have just noticed this at the bottom of page six of the supplement: “We estimate the genome coverage to be about 94%, based on comparison to 12 finished CHORI-251 BAC clones. These clones collectively comprise a total of 1,265,617 bases of sequence. Table S11 shows detailed information of these comparisons. Specifically, ARACHNE covers 1,186,774 bases, or 93.8% of the clones, while PCAP covers 1,189,836 bases, or 94.0% of the clones.”

So I guess that this is where the 94% figure came from. It is a tiny sample size. I have to say, I find your 87% figure more convincing as a genome-wide estimate.

T_aquaticus · May 16, 2018, 4:04pm

It is also difficult to use chromosome walking (i.e. Sanger sequencing) in BAC clones because primers for repeat regions will bind all over the place instead of binding to a specific and unique sequence. Although I don’t have direct experience with total genome sequencing I do have extensive experience with PCR and sequencing of small stretches, so I think I have a handle on the difficulties.

I would also think that there is a lot of variation in the human population in regions with a lot of repeats for the very same reasons you have outlined. Would it be expected that the reported similarities between human genomes goes down once these repeat regions are considered?

BAC clones should solve these problems as they are sequenced over time.

Chimpanzee sequence is much more similar to human sequence than other species as expected from common ancestry and evolution, not morphology. You could have very, very different genomes and still have nearly identical morphology. To use an analogy, the Google Chrome web browser looks almost identical on PC and Mac, but the underlying machine code is very different. The same applies to DNA. In fact, only a tiny, tiny portion of the overall genome has any affect on morphology, yet the whole genome is very similar.

RichardBuggs · May 21, 2018, 3:55pm

Hi @glipsnort, I am very much hoping that you will have time to respond to my most recent post above. I am not making this comment to hurry you, but simply because I don’t want the topic to be automatically closed before you have a chance to respond.
best wishes
Richard

Christy · May 21, 2018, 6:37pm

If it does end up automatically closed, message me or @jpm and one of us can unlock it for you again.

RichardBuggs · May 22, 2018, 5:34pm

Thanks @Christy!

Christy · May 28, 2018, 11:35pm

This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.

RichardBuggs · June 4, 2018, 8:41pm

Hi all,

Until now, apart from the 2005 Chimpanzee genome paper, I have mainly been basing my argument on simple analyses of the (more recent) human-chimp genome alignments provided by UCSC, made using the LASTZ software. I have no reason to think that these are highly inaccurate, but in case any of you had any doubts, I would draw your attention to a paper published this year that aligned the human and chimpanzee genomes using a different piece of software (called MUMmer).

The authors aligned PanTro4 with Hg38 using MUMmer. They write (emphasis added):

“MUMmer had 2.782 Gb of the sequence in mutual best alignments, where each location in the chimp was aligned to its best hit in human and vice versa, with an average identity of 98.07%. The 1.93% nucleotide-level divergence found here is higher than the 1.23% reported in the original chimpanzee genome paper [25]. Our higher divergence is likely due to two factors: first, the 2005 report was based on 2.4 Gb of aligned sequence from older versions of both genomes, while ours is based on 2.782 Gb (16% more sequence) aligned between the current, more-complete versions of both genomes. Second, the original report used different methods, and may have counted fewer small indels than were counted in our alignments. Approximately 306 Mb (9.91%) of the human sequence did not align to the chimpanzee sequence, while 138 Mb (4.15%) of the chimpanzee sequence did not align to human. We detected 390 Mb in alignments where multiple sequences from chimpanzee aligned to the same location in human sequence and thus only one was chosen as the best alignment based on alignment identity.”

T_aquaticus · June 4, 2018, 9:11pm

So if I am reading that correct, the previous comparison had 2.4 Gb aligned which was 0.3 Gb less than the subsequent alignment. Previously, you seemed to indicate that DNA didn’t align because it lacked an ortholog in the human genome. However, they were able to align 0.3 Gb of sequence that they were not able to previously, and the overall percentage was pretty much the same (1.23 v. 1.93%).

So how does this fit into your larger analysis of these genomes?

RichardBuggs · June 5, 2018, 5:41pm

Hi @T_aquaticus, if you want to compare these results from MUMmer directly with my previous analyses, I would refer you to my post #40 above, where I give data for the PanTro4 chimp versus Hg38 human genome assemblies

I have already discussed in some detail in posts above why the more recent alignments are giving greater length of alignment to the 2005 chimpanzee paper, and what we can and can’t infer from this.

RichardBuggs · June 5, 2018, 6:08pm

Hi @DennisVenema thanks for raising this topic several weeks ago. This has been a very interesting discussion, and I am especially grateful to @glipsnort for helping me to understand the shortcomings of the 2005 paper and assembly, which I had not fully appreciated before.

I hope you have had a chance to engage with the data I have presented and perhaps re-evaluate your own understanding of the similarity of the human and chimpanzee genomes.

I notice that on page 32 of Adam and the Genome you wrote of humans and chimpanzees “our entire genomes are either around 95 per cent or 98 per cent identical depending on how one counts the effects of deletions of small blocks of DNA”.

As I think the discussion above shows, this claim is wrong. Our “entire genomes” have not been shown to be 95-98% identical to chimpanzee genomes. As my post #40 above shows: about 5% of the human genome has not yet been assembled, 4% of the human genome shows no alignment to the most recent chimpanzee assemblies, 5% is different due to copy number variation, over 1% is different due to SNPs. In fact, we can only be totally sure at present that 84% of the human genome is identical to the chimpanzee genome.

In response to your original request, I have freely admitted that the prediction that I made ten years ago about human-chimp similarity was wrong. I hope that you are also willing to admit that the claim that you made in Adam and the Genome on this topic is also wrong!

T_aquaticus · June 5, 2018, 7:09pm

Is this due to a lack of quality in the sequencing or a real difference in sequence?

But why would the other 16% be different from the 84% that has already been compared? You again seem to be under the false impression that a lack of alignment means dissimilar.

Bill_II · June 5, 2018, 8:46pm

But the paper you just linked to says

Who is right?

RichardBuggs · June 9, 2018, 6:02pm

Yes it says that in the abstract. But you need to look at the data presented in the paper. Does it support that statement?

RichardBuggs · June 9, 2018, 6:06pm

We have discussed this question quite extensively above.

Again, we have discussed this extensively above.

Bill_II · June 9, 2018, 6:26pm

I have read the paper. It says this also:

I understand that the purpose of the paper is just to show the new version of MUMmer is an improvement over the older version. The chimp/human comparison was done to illustrate the improvements. But to a layman it is clear chimp and human DNA is almost identical. Much closer than the 84% you appear to accept.

RichardBuggs · June 12, 2018, 2:29pm

Hi Bill,
Good to hear that you have read the paper. How do you reconcile the parts you have quoted with their statement: “Approximately 306 Mb (9.91%) of the human sequence did not align to the chimpanzee sequence”?
best wishes,
Richard

Bill_II · June 12, 2018, 3:01pm

@RichardBuggs As a layman I would say that when an alignment in the DNA between chimp and human can be determined that the DNA is on average 98% identical. Where there is no alignment I would expect no comparison was possible. So you can say it is 98% identical or 90% identical and basically be saying the same thing. Neither of these numbers are anywhere near your 84%.

The important point that has not be addressed as far as I can tell is what is the significance of this level of equivalence?