Human Chimp Genome Similarity

RichardBuggs · June 27, 2018, 7:06am

Dear all,

Here is a final summary of my position.

“How similar are the human and chimpanzee genomes?” is a relatively straightforward scientific question. We are hindered by the still somewhat incomplete nature of both the human and the chimpanzee reference genome assemblies, but we can make this clear in our assessments and allow for the uncertainties that it raises.

The best way to assess the similarity of two genomes is to take complete genome assemblies of both species, that have been assembled independently, and align them together. The alignment process involves searching the contents of the two genomes against each other. Parts of both genomes that are too different to match one another will be absent from the alignment, unless they are very short, in which case they will be included as “indels” (longer indels, even if they have well characterised flanking sequences, will be absent from the alignment). Within parts that do align, there will be some mismatches between the two genomes, where one or a few nucleotides differ, which in this discussion we have been calling “SNPs”. In addition there will be some parts of each genome that are present twice or multiple times in one genome and are present fewer times in the other genome. We have referred to these as “paralogs” or “copy number variants” (CNVs). To come up with an accurate figure of the similarity of the entirety of two genomes, we need to take into account all these types of difference.

For some purposes, when talking about the similarity between two genomes we may want to just focus on one type of difference, such as SNPs. If we do this, we should always specify which types of difference we have and have not taken into account. The most well-known estimates for the similarity of the human and chimpanzee genomes only take into account SNPs and small indels. Copy number variants are less often included, and regions of the two genomes that do not align are commonly ignored.

When assessing the total similarity of the human genome to the chimp genome, we also need to bear in mind that roughly 5% of the human genome has not been fully assembled yet, so the best we can do for that 5% is predict how similar it will be to the chimpanzee genome. We do not yet know for sure. The chimpanzee genome assembly is less well assembled, so in future we may assemble parts of the chimpanzee genome that are similar to the human genome - this is another source of uncertainty to keep in mind.

To come up with the most accurate current assessment that I could of the similarity of the human and chimpanzee genome, I downloaded from the UCSC genomics website the latest alignments (made using the LASTZ software) between the human and chimpanzee genome assemblies, hg38 and pantro6. See post #35 above for details. This gave the following for the human genome:

4.06% had no alignment to the chimp assembly
5.18% was in CNVs relative to chimp
1.12% differed due to SNPs in the one-to-one best aligned regions
0.28% differed due to indels within the one-to-one best aligned regions

The percentage of nucleotides in the human genome that had one-to-one exact matches in the chimpanzee genome was 84.38%

In order to assess how improvements in genome assemblies can change these figures, I did the same analyses on the alignment of the older PanTro4 assembly against Hg38 (see post #40 above). The Pantro4 assembly was based on a much smaller amount of sequencing than the Pantro6 assembly (see post #39 above). In this Pantro4 alignment:

6.29% had no alignment to the chimp assembly
5.01% was in CNVs relative to chimp
1.11% differed due to SNPs in the one-to-one best aligned regions
0.28% differed due to indels within the one-to-one best aligned regions

The percentage of nucleotides in the human genome that had one-to-one exact matches in the chimpanzee genome was 82.34%.

Thus the large improvement in the chimpanzee genome assembly between PanTro4 and PanTro6 has led to an increase in CNVs detected, and a decrease in the non-aligning regions. It has only increased the one-to-one exact matches from 82.34% to 84.38% even though the chimpanzee genome assembly is at least 8% more complete (I think) in PanTro6.

The PanTro4 assembly has also been aligned to the human genome using the software Mummer 4 (reported in: Marçais, Guillaume, et al. “MUMmer4: A fast and versatile genome alignment system.” PLoS computational biology 14.1 (2018): e1005944). This method gives broadly similar figures to my analyses of the UCSC LASTZ alignments. MUMmer places 2.782 Gb of the sequence in mutual best alignments, and the total length of the LASTZ alignment is 2.761Gb. In the MUMmer analysis approximately 306 Mb (9.91%) of the human sequence did not align to the chimpanzee sequence in mutual best alignments. This fits well with the LASTZ result of 6.29% non-aligning plus 5.01% CNV = 11.30% not aligning. Overall, the MUMer software has been slightly more generous in aligning the human and chimp genomes, but as @glipsnort has pointed out, MUMer is giving a higher estimate of SNP differences within its alignments. This is probably a signal that it has over-aligned the two genomes and some of its alignments are spurious. Thus I think we are best off trusting the LASTZ alignment over the MUMer alignment, though the difference between the results of the two methods is rather small.

As 5% of the human genome is still unassembled, and 5% seems to be CNVs relative to chimp, and 4% is unaligned to the chimp genome, I cannot agree with @DennisVenema and @glipsnort that “95% is the best estimate we have for the genome-wide identity of chimps and humans”. I would accept 95% as a prediction, but not as a statement of established fact.

I predict that the 95% figure will prove to be wrong, because (on the basis of my comparison of the PanTro4 and PanTro6 alignments to Hg38) I think that the CNV differences are here to stay, and I doubt that all of the currently unaligned or unsequenced regions of the human genome will prove to all be 95% the same as the chimpanzee genome. Some of the “unaligned” human sequences are medium-sized indels, and it is hard to see why they would not have been assembled in the chimp if they were present. I also expect at least some of these unaligned or unsequenced sequences to be rapidly evolving.

In 2008 I wrote “I predict that when we have a reliable, complete chimpanzee genome, the overall similarity of the human genome will prove to be close to 70% (and very far from 99%).” This prediction is not borne out by the more recent data above. I made a mistake in my 2008 calculations in the way in which I dealt with CNVs, which put me out by 2.7%, but this was only a minor component of why my estimate was so low. The main reason why my estimate was so low was because I thought that the 2005 chimpanzee genome assembly was far more complete than it actually was. This was because the authors claimed in the main text of the chimpanzee paper "the draft genome assembly…covers ~94% of the chimpanzee genome with >98% of the sequence in high-quality bases.” Thanks to discussion in this thread with @glipsnort (see post #62, #63 and others above), who was one of the authors of the 2005 chimp genome paper, I can now see that the 2005 draft genome assembly was not as good as this claim suggested. However, in 2008 I did not know this, and my prediction was made in good faith on the basis of my understanding of the 2005 paper.

Thank you all for an interesting discussion. Please accept this as my final summing up and closing statement.

Best regards,
Richard

T.j_Runyon · June 27, 2018, 8:20am

Thank you for taking the time to participate. I think @DennisVenema and @glipsnort should provide final statement as well if they have time to do so

Jay313 · June 27, 2018, 11:33am

Thank you for your time and efforts. Please feel free to come back and contribute your perspective on other topics, as well. I would be interested to hear your thoughts on something other than your specialty. Of course, I’m assuming that you do think about things besides genetics once in a while …

Bill_II · June 27, 2018, 2:04pm

I have a feeling that if the 95% does prove out Dr. Buggs would still argue that it doesn’t indicate common ancestry.

T_aquaticus · June 27, 2018, 4:16pm

Neither the chimp nor the human genome are completely assembled, so that would be the first major problem. There are gaps in each alignment, and those gaps will be in different places in each alignment. This means that a lack of a match between the genome assemblies could simply be a gap in the alignment in one of the genomes.

It would be interesting to see the results for the same comparison of the chimp and gorilla genomes. It is the pattern of similarity that evidences common ancestry and evolution, not a set percentage. If there are more differences between the chimp and gorilla genome, then what?

gbrooks9 · June 27, 2018, 7:00pm

RichardBuggs:

The percentage of nucleotides in the human genome that had one-to-one exact matches in the chimpanzee genome was 82.34%.

o o o

As 5% of the human genome is still unassembled, and 5% seems to be CNVs relative to chimp, and 4% is unaligned to the chimp genome, I cannot agree with @DennisVenema and @glipsnort that “95% is the best estimate we have for the genome-wide identity of chimps and humans”. I would accept 95% as a prediction, but not as a statement of established fact.

o o o

In 2008 I wrote “I predict that when we have a reliable, complete chimpanzee genome, the overall similarity of the human genome will prove to be close to 70% (and very far from 99%).” This prediction is not borne out by the more recent data above. I made a mistake in my 2008 calculations in the way in which I dealt with CNVs, which put me out by 2.7%, but this was only a minor component of why my estimate was so low. The main reason why my estimate was so low was because I thought that the 2005 chimpanzee genome assembly was far more complete than it actually was. . . . in 2008 I did not know this, and my prediction was made in good faith on the basis of my understanding of the 2005 paper."

@RichardBuggs

You probably know that on the topic of Human/Chimp genome comparison, I’m not exactly an “easy sell”.

But your “Final Summary” is impressive in its conciseness, its frankness and even in its elegant simplification of highly complex data.

I celebrate your most important sentence!:

“The percentage of nucleotides in the human genome that had one-to-one exact matches in the chimpanzee genome was 82.34%.”

Intuitively speaking, this is seems to be the best way most of us can tune into this issue:

by comparing “one-to-one exact matches in the genome”. This might not be the best approach for all purposes, but it is certainly an excellent benchmark to start with. If anyone reads anything in this thread, it should be your Final Summary!

Best wishes, George Brooks

WilliamDJ · July 5, 2018, 8:29am

Good to know this.

Also good to know when comparing program P1 ‘build a chimp’ and program P2 ‘build a human’, is whether P1 and P2 only differ in their parameters or also in their dimensions. If the difference is only in the parameters, P1 can change into P2 by random variation of its parameters and selection. If the difference is also in the dimensions, billions of variations of the parameters during billions of years cannot produce a change in dimensions (= second order change/ transformation / innovation). The progress of science in the ENCODE-project (see: https://www.encodeproject.org/ ) will reveal eventually the type of differences between P1 and P2. Notice that change of P1 and P2 in their dimensions is antagonized by mutation repair systems. (see: Can mutations produce mutation repair systems? - #74 )

Dr. William DeJong
(Evoskepsis)

glipsnort · July 5, 2018, 11:54am

This might be a reasonable question if genomes were programs or had parameters and dimensions.

Bill_II · July 5, 2018, 2:19pm

But WilliamDJ says it does so that must mean it is true.

Comparing DNA to a program is not a bad analogy but at a fundamental level it breaks down.

glipsnort · July 5, 2018, 2:34pm

It’s an okay analogy for some purposes, terrible for others. (And programs don’t have dimensions either.)

gbrooks9 · July 5, 2018, 3:35pm

@WilliamDJ

Are you hijacking the thread with the mutation repair Red Herring? 1. If there are still mutations despite mutation repair systems, and if mutation repair systems are genetically established, I really can’t see the point of bringing back that old saw horse to a topic that won’t benefit a bit by its introduction!

Bill_II · July 5, 2018, 5:38pm

I wrote many a FORTRAN program that was just full of dimensions.

aarceng · July 16, 2018, 8:45am

My understanding is that the Chimpanzee genome was assembled using the human genome as a framework. Is this correct and is it still the case?

martin_r · September 5, 2018, 7:41am

human genome was never completely sequenced, (9% is still missing), neither was chimpanzee nor any other mammalian genome…

So how can you compare incomplete data?

2017 article:

martin_r · September 5, 2018, 7:45am

human vs. chimpanzee

look at this video lecture made by secular scientists, very clear, very simple to understand, show it to your kids…

here is the most important idea from the lecture (at 1:27)
.

“yes, we share 99% of our DNA with chimps, if we ignore 18% of their genome and 25% of ours”
.
.
.

Bill_II · September 5, 2018, 12:32pm

@martin_r You might try reading this thread from the beginning. This horse is well and truly dead.

martin_r · September 5, 2018, 1:09pm

Bill… i will…

tell me Bill, the video lecture i have posted here is wrong?

especially this part "“yes, we share 99% of our DNA with chimps, if we ignore 18% of their genome and 25% of ours”

Is it false? before i read the whole forum from the start…

Chris_Falter · September 5, 2018, 1:39pm

Let’s do a thought experiment. You find 2 books that have different titles, and you start reading each of them. When you’re about 80-90% done, you compare passages between the 2 books. By working carefully, you are able to find paragraphs and sometimes even entire chapters that are 97% word-for-word identical.

Even though you did not read 100% of the books, would you be able to draw a conclusion about whether they have a common origin? In the book metaphor, of course, the common origin could have several explanations (e.g., common author, 2 different editions of the same book, one author plagiarized another, etc.).

Hope that helps.

Chris

Bill_II · September 5, 2018, 1:48pm

@martin_r First, youtube videos are not the best source of information. Given the numbers quoted in this thread which come from recent peer reviewed papers I can say, as a layman, the numbers you quoted are very outdated.

Read the entire thread and see what you think.

martin_r · September 5, 2018, 2:14pm

Chris, don’t post these kinds of thought experiments… i posted here a video lecture made by secular scientists, and their main message is:

"yes, we share 99% of our DNA with chimps, if we ignore 18% of their genome and 25% of ours”

So where is your “97% word-for-word identical.” or are these guys wrong and you are right? I am a layman, i don’t know who to trust…

Have you watched the video?