I would agree that repetitive DNA would be more susceptible to indels through genetic recombination, but I also think #1 explains #2. Repetitive sections are hard to sequence, and this would be true in both the human and chimp genomes. Therefore, there is a very good chance that newly added chimp sequence will align with regions that have yet to be assembled in the human genome.
I would also be curious what your results would be when comparing the chimp genome to the genomes of other apes, and also the percentage similarity when comparing two human genomes.
It certainly seems obvious on the face of it that human’s are more closely related to chimps than to any other living animal based on obvious morphological similarities alone. That the close relationship should be found in the DNA as well should not be surprising.
No one should have to defend the position that humans aren’t animals or that we haven’t been evolving just as every other living life form has over the eons. Accepting this should not be a bridge too far for a Christian, and if it is then clearly the theology needs adjusting.
Perhaps the solution is to attribute natural selection to God’s plan. Or perhaps the concept of God itself could be dialed down from that of creator of everything whatsoever to something inside ourselves which makes our oh-so-different human experience possible.
Hi T, yes repetitive regions are more likely to have mutations due to unequal crossovers during recombination. The main reason why they are hard to assemble it that when a particular sequence is repeated many times in tandem (one after another after another…) it is hard to know how many times that repeat occurs. We don’t know which copy of the repeat our DNA reads come from. It is a bit like the pieces of a jigsaw puzzle (to use analogy you used earlier) that are all from a blue sky. They are all exactly the same so it is very hard to place them, and unlike a jigsaw puzzle, they don’t have unique shapes that allow us to place them uniquely in the end. The only way to properly assemble them is to use very long reads (such as PacBio, which was used in PanTro5 and PanTro6) or methods that pair up two reads over long distances (mate pair libraries), but even today our methods are not always good enough. Obviously a lot of effort has gone into this for the human genome, and somewhat less for the chimpanzee, though what has been done on the chimpanzee is pretty impressive. A lot of repetitive DNA has been assembled in both genomes.
It is worth noting, however, that it is often the number of copies of a repeat that we don’t know, not the sequence of the repeat itself. So if, for example, we are still unsure of how many times a particular repeat occurs in the chimpanzee genome, we may still know the sequence of the repeat and be able to align that to the human genome. In the .net.axt alignments (see post 35 above) the single sequence we have from the chimpanzee would be able to align multiple times to the human genome, so in my analyses in post 35 and post 40, these would show up as “Paralogs in axtnet alignment”, not as “Unaligned”.
Another reason why a region of the genome may be had to assemble is because it is highly heterozygous in the individual being sequenced, or for some other reason it is highly variable among allelic copies in the material being sequenced. Such regions are likely to be rapidly evolving, or under long term balancing selection.
For these reasons, (and the point I have made previously about the comparison of PanTro4 and PanTro5 alignments with Hg38) I would not be too optimistic that newly added chimp sequence will align with regions that have yet to be assembled in the human genome.
I haven’t looked at that, but looking at UCSC alignments of other species to the human genome, the chimpanzee has much more percent identity that any of the other species, as one would expect from morphology (as @MarkD has noted). The overall percentage of human vs other species genomes that are identical is in the 80s for chimpanzee and bonobo, in the 70s for the macaque, in the 40s for tarsier, in the 30s for cat, dog and cow, in the 20s for mouse and rat.
Hi @glipsnort, if you have any further concerns about my analyses of the current data, please do let me know.
Meanwhile, I am just trying to trace something down from your earlier critique of my 2008 statements about the 2005 chimpanzee genome paper.
Please could you let me know where you got the 87% and 10% figures from? I have had another look at the 2005 chimpanzee genome paper, which says on page 70:
“The draft genome assembly—generated from ~3.6-fold sequence redundancy of the autosomes and ~1.8-fold redundancy of both sex chromosomes— covers ~94% of the chimpanzee genome with >98% of the sequence in high-quality bases.”
This is where I got my 94% figure from (Note: it is for assembled reads). Where do you get an 87% figure? Is it somewhere in the supplements or a follow-up paper? And where do the authors suggest that 10% of the assembly is wrong?
I am very keen to trace the source of your numbers, because I would like to know if I could have known them in 2008.
Hi @RichardBuggs. My main concern is something I already asked about, albeit somewhat tersely. As I recall, the main discrepancy between your current estimates of identity and the numbers in the chimp genome paper is the roughly 5% of the genome that looked like structural variation. I assume these are human regions that align best to a location in the chimpanzee genome whose best alignment is somewhere in the human genome other than the starting point.
Something like 5% to 10% of the human genome is known to show copy number variation; these regions are notoriously difficult to call correctly in individual genomes and are a known source of incorrect variant calling (e.g. this paper). Build 38 of the human reference genome contains substantial regions with alternate assemblies to describe some of this variation (I think amounting to 2% of the genome, but I could be mistaken). How did the alignment procedure you are using handle these regions? More broadly, to set an upper limit on the identity of the two genomes, you have to either allow for errors in assigning the correct location of reads from variable regions, or show that the error is negligible. Given the state of the genome assemblies and the alignment procedure used, what is your estimate of this error rate, and how did you make that estimate?
My numbers come from the same supplemental note. The 87% comes from this: “The 37,931 chimpanzee scaffolds comprise 2.73 Gb of sequence and span 3.109 Gb of the genome. Of these, a total of 33,180 (2.70 Gb of sequence spanning 3.077 Gb) scaffolds had significant alignments to the human genome.” That is the summary statement of the amount of chimpanzee sequence that aligned to the human reference. The 10% of the assembly that they thought was wrong was the fraction removed in the steps described in the following paragraphs, which lead to this summary statement: “The total anchored sequence after these steps dropped to 2.74 Gb (2.41 Gb of actual contig length), or 88% of the total chimpanzee sequence.” Some of that .3 Gb that was removed in these steps may represent genuine structural differences between the genomes, but they thought it was bad assembly, and they were likely right for the bulk of it.
I honestly don’t know what the 94% figure means exactly. My guess is that it expresses the fraction of the genome that was covered by aligned contigs, but it clearly doesn’t represent the total number of bases actually aligned, since that number is given above. Note that elsewhere in that supplemental note, the “base coverage” (not “coverage”) of coding regions on chromosome 21, which they used to check their assembly and alignment, is given as 89%: “The WGS chromosome 21 sequence yielded a mean base coverage of the human coding regions of 89.3%.” Since coding regions are likely to be better covered than elsewhere in the genome, the total base coverage is likely to be lower than this estimate.
Thanks for your response. The “maximum” estimate I made of similarity between the human and chimpanzee genomes is heavily affected by structural variation (which I call “Paralogs in the axtnet alignment” in the tables I presented above), so I am very happy to have a careful think about whether this number may be inaccurate.
The main chimp genome paper did not quantify differences due to structural variation. These were examined in 2005 in a separate paper here, where the authors concluded: “Nevertheless, base per base, large segmental duplication events have had a greater impact (2.7%) in altering the genomic landscape of these two species than single-base-pair substitution (1.2%).” So the challenge for me is to explain why this figure now seems to be 5% rather than 2.7%. I think this is because improvements in both the human and chimpanzee genome assemblies have led to a better assembly of copy number variants. It may also be because the 2005 paper on this issue only looked at larger CNVs.
Yes, that is my assumption also.
I assume it used the primary sequence and did not include the alternatives. I think this would be a reasonable approach.
I think you are asking me to come up with: (1) an estimate of the error rate in the human genome assembly that is due to mis-placed reads, and (2) the corresponding figure for the chimpanzee assembly, and (3) the error rate in the alignments of the two that is due to incorrect placement of paralogs. That is a pretty big task, and I don’t know if it is even possible. All I can do is give figures for what the most up to date assemblies and alignments are saying, and give caveats about revisions that could come in the future as we get more and more accurate assemblies and alignments.
As I mentioned above, I had wondered if the estimated difference between humans and chimps due to CNVs would go down as the chimpanzee genome improved, but in fact it has kept increasing.
If it is number (3) that is you main concern, I don’t think this would have a major effect on the overall figures. Misplaced one-to-one reciprocal best alignment are most likely to occur where the repeat copies are highly similar, so the only effect of a misplaced one-to-one reciprocal best alignment will be a slight over-estimate of the number of SNPs or indels within that alignment. It won’t make a big difference to the estimate of difference due to CNVs, as far as I can see.
b) What was known about the chimpanzee genome in 2008
Earlier you commented to me about my analysis of the 2005 chimpanzee genome paper:
Now you have added:
Thanks for pointing me to the “Creation of Chimp AGP Files” section of the supplement. I had not looked at this very closely before, as this part of the study was not mentioned in the main paper itself. I had assumed that the main purpose of this particular supplemental section was to order the chimpanzee scaffolds on to the human assembly, so that the chimpanzee assembly could be pseudo-chromosomal.
As you say, this supplement section gives a rather difference picture the prominent claim in the main text: “The draft genome assembly…covers ~94% of the chimpanzee genome with >98% of the sequence in high-quality bases”. Given the prominence of the latter claim, I not surprised that when I wrote about this paper in 2008 I thought that the 2005 paper presented a higher quality chimpanzee genome assembly than it actually does.
I am not sure I can agree with you on this point. The authors responsible for this section do not say that these parts of the assembly were wrong. Their aim seems to be to come up with a higher-level scaffolding of the chimpanzee genome using one-to-one reciprocal best alignments with the human genome. The parts of the chimpanzee assembly that they don’t include in this are copy number variants, scaffolds that align partly to one human chromosome and partly to another, and a few other categories. The authors don’t seem to be suggesting that these were misassembled in the chimpanzee assembly - it is just that they don’t fit in well with an approach to super-scaffolding that assumes that the human and chimpanzee genomes contain no structural rearrangements.
Did the authors of this part of the study tell you privately that they thought that the parts of the assembly that they discarded from their alignment with the human genome were wrong?
Even if they did think that these parts of the assembly were wrong, their grounds for thinking so were because these parts did not align to the human genome as well as they expected. If we follow that reasoning when trying to estimate a total percentage similarity between the human and chimpanzee genomes, we introduce a degree of circularity.
I have just noticed this at the bottom of page six of the supplement: “We estimate the genome coverage to be about 94%, based on comparison to 12 finished CHORI-251 BAC clones. These clones collectively comprise a total of 1,265,617 bases of sequence. Table S11 shows detailed information of these comparisons. Specifically, ARACHNE covers 1,186,774 bases, or 93.8% of the clones, while PCAP covers 1,189,836 bases, or 94.0% of the clones.”
So I guess that this is where the 94% figure came from. It is a tiny sample size. I have to say, I find your 87% figure more convincing as a genome-wide estimate.
It is also difficult to use chromosome walking (i.e. Sanger sequencing) in BAC clones because primers for repeat regions will bind all over the place instead of binding to a specific and unique sequence. Although I don’t have direct experience with total genome sequencing I do have extensive experience with PCR and sequencing of small stretches, so I think I have a handle on the difficulties.
I would also think that there is a lot of variation in the human population in regions with a lot of repeats for the very same reasons you have outlined. Would it be expected that the reported similarities between human genomes goes down once these repeat regions are considered?
BAC clones should solve these problems as they are sequenced over time.
Chimpanzee sequence is much more similar to human sequence than other species as expected from common ancestry and evolution, not morphology. You could have very, very different genomes and still have nearly identical morphology. To use an analogy, the Google Chrome web browser looks almost identical on PC and Mac, but the underlying machine code is very different. The same applies to DNA. In fact, only a tiny, tiny portion of the overall genome has any affect on morphology, yet the whole genome is very similar.
Hi @glipsnort, I am very much hoping that you will have time to respond to my most recent post above. I am not making this comment to hurry you, but simply because I don’t want the topic to be automatically closed before you have a chance to respond.
Until now, apart from the 2005 Chimpanzee genome paper, I have mainly been basing my argument on simple analyses of the (more recent) human-chimp genome alignments provided by UCSC, made using the LASTZ software. I have no reason to think that these are highly inaccurate, but in case any of you had any doubts, I would draw your attention to a paper published this year that aligned the human and chimpanzee genomes using a different piece of software (called MUMmer).
The authors aligned PanTro4 with Hg38 using MUMmer. They write (emphasis added):
“MUMmer had 2.782 Gb of the sequence in mutual best alignments, where each location in the chimp was aligned to its best hit in human and vice versa, with an average identity of 98.07%. The 1.93% nucleotide-level divergence found here is higher than the 1.23% reported in the original chimpanzee genome paper . Our higher divergence is likely due to two factors: first, the 2005 report was based on 2.4 Gb of aligned sequence from older versions of both genomes, while ours is based on 2.782 Gb (16% more sequence) aligned between the current, more-complete versions of both genomes. Second, the original report used different methods, and may have counted fewer small indels than were counted in our alignments. Approximately 306 Mb (9.91%) of the human sequence did not align to the chimpanzee sequence, while 138 Mb (4.15%) of the chimpanzee sequence did not align to human. We detected 390 Mb in alignments where multiple sequences from chimpanzee aligned to the same location in human sequence and thus only one was chosen as the best alignment based on alignment identity.”
So if I am reading that correct, the previous comparison had 2.4 Gb aligned which was 0.3 Gb less than the subsequent alignment. Previously, you seemed to indicate that DNA didn’t align because it lacked an ortholog in the human genome. However, they were able to align 0.3 Gb of sequence that they were not able to previously, and the overall percentage was pretty much the same (1.23 v. 1.93%).
So how does this fit into your larger analysis of these genomes?
Hi @T_aquaticus, if you want to compare these results from MUMmer directly with my previous analyses, I would refer you to my post #40 above, where I give data for the PanTro4 chimp versus Hg38 human genome assemblies
I have already discussed in some detail in posts above why the more recent alignments are giving greater length of alignment to the 2005 chimpanzee paper, and what we can and can’t infer from this.
Hi @DennisVenema thanks for raising this topic several weeks ago. This has been a very interesting discussion, and I am especially grateful to @glipsnort for helping me to understand the shortcomings of the 2005 paper and assembly, which I had not fully appreciated before.
I hope you have had a chance to engage with the data I have presented and perhaps re-evaluate your own understanding of the similarity of the human and chimpanzee genomes.
I notice that on page 32 of Adam and the Genome you wrote of humans and chimpanzees “our entire genomes are either around 95 per cent or 98 per cent identical depending on how one counts the effects of deletions of small blocks of DNA”.
As I think the discussion above shows, this claim is wrong. Our “entire genomes” have not been shown to be 95-98% identical to chimpanzee genomes. As my post #40 above shows: about 5% of the human genome has not yet been assembled, 4% of the human genome shows no alignment to the most recent chimpanzee assemblies, 5% is different due to copy number variation, over 1% is different due to SNPs. In fact, we can only be totally sure at present that 84% of the human genome is identical to the chimpanzee genome.
In response to your original request, I have freely admitted that the prediction that I made ten years ago about human-chimp similarity was wrong. I hope that you are also willing to admit that the claim that you made in Adam and the Genome on this topic is also wrong!