Human Chimp Genome Similarity

RichardBuggs · May 8, 2018, 9:51pm

Hi all,
I would be very grateful if someone was willing to move this discussion forward by making a case, based on current data, that the entire human genome is over 94% identical to the chimpanzee genome.
many thanks,
Richard

gbrooks9 · May 8, 2018, 10:16pm

What if someone said, it doesn’t matter what the exact percentage is?

glipsnort · May 8, 2018, 10:55pm

Since I know nothing about the quality or properties of the alignment being used, I have no idea what value could be extracted from it.

Bill_II · May 9, 2018, 1:06am

Silly question, but what is the value in coming up with a value? To a lay person once you get near 90% that is in the almost identical region.

For 2 randomly selected living humans what kind of similarity percentage could you expect?

RichardBuggs · May 9, 2018, 9:42am

Hi Steve,

See: Index of /goldenpath/hg38/vsPanTro6 and Index of /goldenpath/hg38/vsPanTro4

Here is the information available at the first URL above:

This directory contains alignments of the following assemblies:

target/reference: Human
(hg38, Dec. 2013 (GRCh38/hg38),
GRCh38 Genome Reference Consortium Human Reference 38 (GCA_000001405.15))
query: Chimp
(panTro6, Jan. 2018 (Clint_PTRv2/panTro6),
University of Washington)

Files included in this directory:

md5sum.txt: md5sum checksums for the files in this directory
hg38.panTro6.all.chain.gz: chained lastz alignments. The chain format is
described in Genome Browser Chain Format .
hg38.panTro6.net.gz: “net” file that describes rearrangements between
the species and the best Chimp match to any part of the
Human genome. The net format is described in
Genome Browser Net Format .
hg38.panTro6.net.axt.gz: chained and netted alignments,
i.e. the best chains in the Human genome, with gaps in the best
chains filled in by next-best chains where possible. The axt format is
described in Genome Browser axt Alignment Format .
hg38.panTro6.synNet.maf.gz - filtered net file for syntenic alignments
only, in MAF format, see also, description of MAF format:
Genome Browser FAQ
hg38.panTro6.syn.net.gz - filtered net file for syntenic alignments only
reciprocalBest/ directory, contains reciprocal-best netted chains
for hg38-panTro6

The hg38 and panTro6 assemblies were aligned by the lastz alignment
program, which is available from Webb Miller’s lab at Penn State
University (CCGB: Miller Lab). Any hg38 sequences larger
than 20,010,000 bases were split into chunks of 20,010,000 bases overlapping
by 10,000 bases for alignment. A similar process was followed for panTro6,
with chunks of 20,000,000 overlapping by 0. Following alignment, the
coordinates of the chunk alignments were corrected by the
blastz-normalizeLav script written by Scott Schwartz of Penn State.

The lastz scoring matrix (Q parameter) used was:

        A    C    G    T
  A     90 -330 -236 -356
  C   -330  100 -318 -236
  G   -236 -318  100 -330
  T   -356 -236 -330   90

with a gap open penalty of O=600 and a gap extension penalty of E=150.
The minimum score for an alignment to be kept was K=4500 for the first pass
and L=4500 for the second pass, which restricted the search space to the
regions between two alignments found in the first pass. The minimum
score for alignments to be interpolated between was H=2000. Other blastz
parameters specifically set for this species pair:
E=150
M=254
O=600
T=2
Y=15000

The .lav format lastz output was translated to the .psl format with
lavToPsl, then chained by the axtChain program.

Chain minimum score: 5000, and linearGap matrix of (medium):
tableSize 11
smallSize 111
position 1 2 3 11 111 2111 12111 32111 72111 152111 252111
qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900
tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900
bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300

Chained alignments were processed into nets by the chainNet, netSyntenic,
and netClass programs.
Best-chain alignments in axt format were extracted by the netToAxt program.
All programs run after lastz were written by Jim Kent at UCSC.

References

Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence
alignments. Pac Symp Biocomput. 2002:115-26.

Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D.
Evolution’s cauldron: Duplication, deletion, and rearrangement in the
mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep
30;100(20):11484-9.

Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC,
Haussler D, Miller W. Human-Mouse Alignments with BLASTZ. Genome
Res. 2003 Jan;13(1):103-7.

RichardBuggs · May 10, 2018, 4:12pm

I think that many of you may have similar thoughts to @T_aquaticus in response to the estimates I am giving, so I will respond as best I can.

The 96.5% figure is from within the alignment of the two genomes. To get an alignment, you have to do a search for the similar parts of the two genomes. This means that in the alignment you will only get parts of the genome that are similar. There could be parts of the genome that have been accurately sequenced but genuinely differ by over, say 25% (depending on exactly what threshold was set in making the alignments), but will be left out of the alignment.

If we want to calculate the similarity of the entire genomes, we need to take into account the possibility of such regions of the genome. If these regions happen to be large, they will lead to very different estimates than those estimates that are only based on the sequences that have been aligned. My calculations based on the most recent alignments of the human and chimpanzee genomes show that about 4% of the assembled human genome (excluding gaps in the human assembly) does not currently have an alignment to the assembled chimpanzee genome.

This figure could go down as the chimpanzee assembly is improved, as has been pointed out several times in this discussion. That is true. But last time there was a substantial improvement in the chimpanzee assembly (from PanTro4 to PanTro5), whilst 230,249,387 sequenced bases were added to the chimpanzee genome, the one-to-one orthology between the human and chimpanzee genomes only grew by 59,439,415 bases. I think this is probably to be expected, because the parts of genomes that are hard to assemble tend to be the most repetitive parts, which also tend to be the fastest evolving. Thus I do not expect the whole of the human genome to show alignment once the chimpanzee genome is complete. For similar reasons, I think it is unlikely that the 5% of the human genome assembly that is currently made up of gaps is likely to all be alignable to the chimpanzee genome (assuming sensible alignment thresholds).

However, note that in my “maximum” estimate figure in Post 40 above (i.e. 93.43% similarity between humans and chimps) I have made the unrealistic assumption that all of the regions of the human genome assembly that currently have no alignment to the chimp assembly, and all the regions that are currently gaps, are made up of sequences that have true one-to-one orthology with the chimpanzee genome.

Also, to come back to your 96.5% figure, if I understand correctly where you are getting this figure from it does not include copy number variants. The recent comparisons I have done of the latest alignments between human and chimpanzee genome assemblies suggest that about 5% of the human genome may be copy number variants of sequences in the chimpanzee genome. This figure may change as both genomes get better resolved, but it actually increased last time there was a major improvement in the chimpanzee genome assembly, from 4.86% with PanTro4 to 5.03% with PanTro6.

Thus, I don’t think it is unreasonable that once unaligned regions and copy number variants are included in a percentage similarity figure - and they have to be if we are to get a figure that summarises the entire genomes - that this will be quite different to 96.5%.

One day we will known the figure for sure. At the moment I posit that we may know enough to make a well-informed prediction.

AMWolfe · May 10, 2018, 4:38pm

I think it’s only fair to comment that I think this is a well-written, clear explanation. I have at times rankled at some of your comments (particularly your more tendentious summaries of Dennis’s positions), but this comment had none of that edge to it. To the contrary, I feel like I learned something, and if there are inaccuracies here, I’d find it helpful to understand them.

Now, at the same time, I come back to the question that has been posed a few times so far, which is, what exactly is the point behind haggling over a few percentage points here and there, except perhaps to cast aspersions on the common descent of humans and chimpanzees or to suggest some great scientific conspiracy that’s unduly driving the similarity numbers higher.

I am curious about this sort of meta-level question of “Why are we even asking this question?”. But as for the explanation itself, I found it lucid. Thanks for posting it!

Bill_II · May 10, 2018, 5:01pm

And if I might add, What do we gain by answering the question? As far as I can tell not much of anything.

AMWolfe · May 10, 2018, 5:07pm

Well, yes and no (imho). I mean, science is built on people repeating the experiments of others and validating their conclusions (or not, indeed!). One of the valid critiques from within the scientific community of the scientific community these days is that repeating others’ experiments isn’t seen as “sexy” enough for CVs and so it never gets done or published. I think there is value in double-checking science – any scientific findings – for precision. Now, in the wider context with the pattern of comments here on the BioLogos Forum, and placed within a sociological setting of science skepticism, it does make one wonder what the real point is. But I think the double-checking itself could be valuable. (Just my very uninformed opinion.)

T_aquaticus · May 10, 2018, 8:20pm

NOTE: I am not an expert on genome sequencing, so you or others should feel free to correct any errors that I make. I am hoping @glipsnort will smack me upside the head and point out my mistakes.

From my limited understanding you have two things: the assembly and the alignment.

The genome assembly is the result of taking the small portions that were sequenced and connecting them into a longer sequence. No assembly is complete, so there will be gaps where it wasn’t clear which short sequences fit into that gap. This could be due to bad sequencing runs or an inability to determine where a specific short sequence fits into the larger sequence.

The alignment takes the human and chimp assembled sequences and puts orthologous sections side by side. Since there are gaps in each assembly there will be no sequence in one assembly where there is good sequence in the other genome. This doesn’t necessarily mean that there is a gap (i.e. indel) in one assembly. It simply means that they don’t know what the sequence of the DNA is in that region.

You seem to count these gaps in the assembly as missing DNA, and then count it as 0% similarity. That seems wrong to me. It isn’t a gap in the genome. It is a gap in the assembly. It is wrong to assume that the chimp and human genomes are 0% similar where there is a lack of sequence information. As those gaps are filled in I would strongly suspect that they would have the same average similarity as the other sections of the genomes where we do have sequence in both assemblies.

Isn’t it entirely possible that the 4% of the human genome that does not currently have an ortholog in the chimp genome is simply DNA sequence that has yet to be added to the chimp genome assembly?

Also, what happens when you use the same technique to compare the chimp and gorilla genomes, or the chimp and orangutan genomes? What about comparing two human genomes? Afterall, the prediction made by the theory of evolution is that we should see a phylogeny.

T_aquaticus · May 10, 2018, 8:25pm

Pedantry is one of the highest art forms in the sciences.

The percentage difference would only matter if it were so large that the known mutation rate, generation time, and time since divergence could not produce those differences. From what I have read, this isn’t a problem.

The larger effort from the ID/creationist side seems to be obfuscation. The plan is to create as much doubt as possible, even if the possible errors they are pointing to are completely irrelevant to the larger picture.

AMWolfe · May 10, 2018, 8:38pm

This is, of course, how it appears to most of us. But I keep hoping that perhaps Richard will show us his cards and answer these sorts of questions himself, rather than content himself with letting us come up with our own explanations (which are often none too flattering). For better or worse, though, that doesn’t seem to be his style… unless I’m missing something (which is of course altogether possible given the speed at which I have skimmed much of these exchanges)…

GJDS · May 11, 2018, 1:11am

I am inclined to ask the same question of the entire “human genome and Adam” controversy that has taken so much comment here - what is the point of experimental data to the teachings of the Christian faith that God created the first true human couple? The only response that appears comprehensible to me at least (and I am certain I am in the minority??) is to indulge in YEC and ID “bashing” (as we say down under).

GJDS · May 11, 2018, 1:38am

I suppose the other side of the coin is claiming certainty when there is none. The entire argument rests on an assumption that a given difference in the comparison can lead, with certainty, to a given conclusion. This is either quantifiable and generally true for all species, or (within you cosy tent of certainty ) it is an assumption, and the need for such an assumption, should be discussed and questioned by scientists - after all, that is the scientific method .

pevaquark · May 11, 2018, 1:38am

I don’t quite follow what you mean here GJDS. I will add that I believe many are wondering what @RichardBuggs is ultimately aiming to do as his work seems to fit into the typical anti-evolution playbook. As in-

They cite areas of uncertainty or controversy, no matter how minor, within the body of research that invalidates their desired course of action
They categorize the overall scientific status of that body of research as uncertain and controversial
Advocate proceeding as if the research did not exist

Now of course, Dr. Buggs is not doing this as he is working quite hard to analyze such similarities. But his arguments are being used by some who completely reject the growing body of scientific research that falls under the umbrella of evolution.

GJDS · May 11, 2018, 1:46am

All of my comments are within the context of the human genome as grounds for redefining, or revising, the faith based understanding that God created Adam and Eve - thus, if some feel the need to change Christian theology based on biology, they must imo bring hard evidence that is directly relevant to the biblical teaching. So far, I have seen (from what I can find time to read) arguments that fail the criteria - this site should not be here to defend ToE; that is for those biologists who have build a career on that. These arguments seem, inevitably, to come down to YEC/TE/ID groupings, and their disagreements - so far, this to me is irrelevant to the topic discussed.

RichardBuggs · May 11, 2018, 4:42pm

Hi @T_aquaticus

It is possible, but as I have argued above, it is unlikely.

The two reasons I have for thinking that the majority of this 4% in reality does not have one-to-one orthology with the chimpanzee genome are:

The parts of the genome that have not yet been assembled are very likely to be regions that are highly repetitive. There are by nature hard to assemble. Such regions also tend to be fast evolving.
Last time a lot of new assembled DNA sequence was added to the chimpanzee assembly, only about a quarter of it provided new one-to-one orthologs with the human genome.

Note though, that, even if we assume that all of the 4% of currently unaligned human sequence really does have a one-to-one ortholog in the chimpanzee, with zero differences (and that all of the currently unassembled parts of the human genome also has one-to-one orthology with the chimpanzee) we still get a an estimate of 93.4% similarity between the human and chimpanzee genomes, which is rather less than the 96.5% figure that you mentioned earlier.

I hope that makes sense.

T_aquaticus · May 11, 2018, 9:28pm

I would agree that repetitive DNA would be more susceptible to indels through genetic recombination, but I also think #1 explains #2. Repetitive sections are hard to sequence, and this would be true in both the human and chimp genomes. Therefore, there is a very good chance that newly added chimp sequence will align with regions that have yet to be assembled in the human genome.

I would also be curious what your results would be when comparing the chimp genome to the genomes of other apes, and also the percentage similarity when comparing two human genomes.

MarkD · May 12, 2018, 4:04pm

It certainly seems obvious on the face of it that human’s are more closely related to chimps than to any other living animal based on obvious morphological similarities alone. That the close relationship should be found in the DNA as well should not be surprising.

No one should have to defend the position that humans aren’t animals or that we haven’t been evolving just as every other living life form has over the eons. Accepting this should not be a bridge too far for a Christian, and if it is then clearly the theology needs adjusting.

Perhaps the solution is to attribute natural selection to God’s plan. Or perhaps the concept of God itself could be dialed down from that of creator of everything whatsoever to something inside ourselves which makes our oh-so-different human experience possible.

RichardBuggs · May 12, 2018, 5:58pm

Hi T, yes repetitive regions are more likely to have mutations due to unequal crossovers during recombination. The main reason why they are hard to assemble it that when a particular sequence is repeated many times in tandem (one after another after another…) it is hard to know how many times that repeat occurs. We don’t know which copy of the repeat our DNA reads come from. It is a bit like the pieces of a jigsaw puzzle (to use analogy you used earlier) that are all from a blue sky. They are all exactly the same so it is very hard to place them, and unlike a jigsaw puzzle, they don’t have unique shapes that allow us to place them uniquely in the end. The only way to properly assemble them is to use very long reads (such as PacBio, which was used in PanTro5 and PanTro6) or methods that pair up two reads over long distances (mate pair libraries), but even today our methods are not always good enough. Obviously a lot of effort has gone into this for the human genome, and somewhat less for the chimpanzee, though what has been done on the chimpanzee is pretty impressive. A lot of repetitive DNA has been assembled in both genomes.

It is worth noting, however, that it is often the number of copies of a repeat that we don’t know, not the sequence of the repeat itself. So if, for example, we are still unsure of how many times a particular repeat occurs in the chimpanzee genome, we may still know the sequence of the repeat and be able to align that to the human genome. In the .net.axt alignments (see post 35 above) the single sequence we have from the chimpanzee would be able to align multiple times to the human genome, so in my analyses in post 35 and post 40, these would show up as “Paralogs in axtnet alignment”, not as “Unaligned”.

Another reason why a region of the genome may be had to assemble is because it is highly heterozygous in the individual being sequenced, or for some other reason it is highly variable among allelic copies in the material being sequenced. Such regions are likely to be rapidly evolving, or under long term balancing selection.

For these reasons, (and the point I have made previously about the comparison of PanTro4 and PanTro5 alignments with Hg38) I would not be too optimistic that newly added chimp sequence will align with regions that have yet to be assembled in the human genome.

I haven’t looked at that, but looking at UCSC alignments of other species to the human genome, the chimpanzee has much more percent identity that any of the other species, as one would expect from morphology (as @MarkD has noted). The overall percentage of human vs other species genomes that are identical is in the 80s for chimpanzee and bonobo, in the 70s for the macaque, in the 40s for tarsier, in the 30s for cat, dog and cow, in the 20s for mouse and rat.