Human Chimp Genome Similarity

Hi all,

I have done some more calculations for you to seek to come up with an upper bound of similarity between human and chimpanzee genomes, as well as a lower bound. This is based on the most recent alignments between human and chimpanzee genomes, available at the UCSC genomics website (hg38 versus PanTro6) see Index of /goldenpath/hg38/vsPanTro6

By my calculations, the human genome has between 84.4% and 93.4% one-to-one orthology with the chimpanzee genome. The uncertainty makes allowance for possible current incompleteness in our knowledge of both the human and chimpanzee genomes. The upper bound of 93.4% assumes that all the regions of the human and chimpanzee genomes that we have not yet assembled and/or aligned will prove to have one-to-one orthology between humans and chimps. The lower bound of 84.4% assumes that these regions will prove to be different between humans and chimpanzees. I have assumed throughout that further sequencing is unlikely to significantly alter currently known differences due to SNPs, indels, and copy number variation.

Here is how I did the calculation. I downloaded both the reciprocal best (“rbest”) alignment, and the “net” alignment (this allows copy number variation from UCSC for hg38 and PanTro6). Using custom PERL scripts, I measured the length of each alignment, and the number of SNPs and insertions in each one. I looked up the “Total assembly gap length” in the Hg38 human genome assembly statistics online (only a negligible number of these were present as Ns in the alignments). These are only known rather approximately, and many are estimated to the nearest 1000 or 10,000 bases.

Here are the stats I calculated for each alignment:
52

I used the differences between the two alignments to work out how much of the net.axt alignment was due to copy number variation, and how many SNPs and insertions there seem to be within copy number variants.

This yielded the following statistics for the overall similarity between the human and chimpanzee genome:
07

In my view, the upper bound of 93.4% is unlikely to prove to be the true value, once we have complete assemblies for both the human and chimpanzee genomes. This is because the regions of genomes that are hardest to assemble tend to be areas that are very repetitive, or fast evolving, or both. I therefore think it is unlikely that the 4.98% of the human genome that is represented by gaps in the hg38 assembly will prove to be orthologous to the chimpanzee genome. In addition, not all regions of the human genome that currently have no alignment look as if they are highly repetitive (though some are). I think that if these non-repetitive regions were present in the chimpanzee genome they would have been successfully sequenced and assembled by now.

I welcome feedback on these calculations and the methods and assumptions behind them, and especially identification of any errors I may have made.

Best wishes,

Richard

@RichardBuggs

Dr. Buggs, what do you hope to demonstrate by having a solid percentage calculated?

Isn’t the Devil in the Detail of how many chromosomal features are shared between humans and chimps… but not as well as other branches of the Great Ape section of the Primate tree?

I don’t see how you can make that claim given the fact that the human and chimp assembled genomes are not complete. There are gaps in both genomes where they don’t have good sequence or don’t have enough data to accurately place good sequence within each genome.

Shouldn’t that be closer to 96.5%? That is the figure in the chimp genome paper for orthologous regions. Do you have a reason why this figure would change so drastically?

1 Like

Are you assuming copy number variants are called correctly in both genomes?

2 Likes

Hi Steve,

I have quite a backlog of questions to deal with, and only half an hour free, so I will try to tick off one of the major ones.

In response to my comment:

You said

As far as I can make out from NCBI, PanTro 3 and 4 were based on 6x Sanger genome coverage. PanTro5 had an additional 55x coverage of Illumina overlapping paired 250bp length reads, 2 Lanes of a Chicago library (Hi-C from Dovetail Genomics) and 9x coverage of PacBio long single molecule reads. The total sequence length of PanTro5 is 3,231,154,112bp (ungapped length =3,132,603,083bp), whereas PanTro4 is 3,309,561,368 (2,902,353,696bp ungapped).

I think it is worth noting that although PanTro5 has considerably more data that PanTro4, and is 8% longer in its ungapped length, it has only yielded an increase of one-to-one orthology with the human genome of 1.9% (from 82.3% to 84.2%).

I can’t find anything online about PanTro6. If you have access to information about how this is an improvement on PanTro5, I would be very grateful.

Hi all,

To explore further the effect of improvement of the chimpanzee genome assembly on percentage similarity estimates for entire human and chimpanzee genomes, I have done the same calculation for the PanTro4 genome assembly (based on 6X Sanger read coverage) as I did earlier for the PanTro6 genome assembly (based on a lot more sequence data and covering more of the chimp genome - see my previous post).

Here are the two sets of stats side by side:

The improved chimpanzee genome has led to slightly greater precision in my estimates of human chimpanzee percentage similarity: the minimum is raised, and the maximum is slightly reduced.

One thing that I was not necessarily expecting is that the size of copy number variant (CNV) regions seem to have increased in size with the improved chimpanzee genome assembly (as shown by the “Paralogs in axtnet alignment” row). I had wondered if improved chimpanzee coverage might decrease this figure, as repeats in the chimpanzee became better resolved, but this does not seem to have occurred.

As ever, I welcome critiques and suggestions.

Hi all,
I would be very grateful if someone was willing to move this discussion forward by making a case, based on current data, that the entire human genome is over 94% identical to the chimpanzee genome.
many thanks,
Richard

What if someone said, it doesn’t matter what the exact percentage is?

Since I know nothing about the quality or properties of the alignment being used, I have no idea what value could be extracted from it.

1 Like

Silly question, but what is the value in coming up with a value? To a lay person once you get near 90% that is in the almost identical region.

For 2 randomly selected living humans what kind of similarity percentage could you expect?

2 Likes

Hi Steve,

See: Index of /goldenpath/hg38/vsPanTro6 and Index of /goldenpath/hg38/vsPanTro4

Here is the information available at the first URL above:


This directory contains alignments of the following assemblies:

  • target/reference: Human
    (hg38, Dec. 2013 (GRCh38/hg38),
    GRCh38 Genome Reference Consortium Human Reference 38 (GCA_000001405.15))

  • query: Chimp
    (panTro6, Jan. 2018 (Clint_PTRv2/panTro6),
    University of Washington)

Files included in this directory:

  • md5sum.txt: md5sum checksums for the files in this directory

  • hg38.panTro6.all.chain.gz: chained lastz alignments. The chain format is
    described in Genome Browser Chain Format .

  • hg38.panTro6.net.gz: “net” file that describes rearrangements between
    the species and the best Chimp match to any part of the
    Human genome. The net format is described in
    Genome Browser Net Format .

  • hg38.panTro6.net.axt.gz: chained and netted alignments,
    i.e. the best chains in the Human genome, with gaps in the best
    chains filled in by next-best chains where possible. The axt format is
    described in Genome Browser axt Alignment Format .

  • hg38.panTro6.synNet.maf.gz - filtered net file for syntenic alignments
    only, in MAF format, see also, description of MAF format:
    Genome Browser FAQ

  • hg38.panTro6.syn.net.gz - filtered net file for syntenic alignments only

  • reciprocalBest/ directory, contains reciprocal-best netted chains
    for hg38-panTro6

The hg38 and panTro6 assemblies were aligned by the lastz alignment
program, which is available from Webb Miller’s lab at Penn State
University (CCGB: Miller Lab). Any hg38 sequences larger
than 20,010,000 bases were split into chunks of 20,010,000 bases overlapping
by 10,000 bases for alignment. A similar process was followed for panTro6,
with chunks of 20,000,000 overlapping by 0. Following alignment, the
coordinates of the chunk alignments were corrected by the
blastz-normalizeLav script written by Scott Schwartz of Penn State.

The lastz scoring matrix (Q parameter) used was:

        A    C    G    T
  A     90 -330 -236 -356
  C   -330  100 -318 -236
  G   -236 -318  100 -330
  T   -356 -236 -330   90

with a gap open penalty of O=600 and a gap extension penalty of E=150.
The minimum score for an alignment to be kept was K=4500 for the first pass
and L=4500 for the second pass, which restricted the search space to the
regions between two alignments found in the first pass. The minimum
score for alignments to be interpolated between was H=2000. Other blastz
parameters specifically set for this species pair:
E=150
M=254
O=600
T=2
Y=15000

The .lav format lastz output was translated to the .psl format with
lavToPsl, then chained by the axtChain program.

Chain minimum score: 5000, and linearGap matrix of (medium):
tableSize 11
smallSize 111
position 1 2 3 11 111 2111 12111 32111 72111 152111 252111
qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900
tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900
bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300

Chained alignments were processed into nets by the chainNet, netSyntenic,
and netClass programs.
Best-chain alignments in axt format were extracted by the netToAxt program.
All programs run after lastz were written by Jim Kent at UCSC.

References

Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence
alignments. Pac Symp Biocomput. 2002:115-26.

Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D.
Evolution’s cauldron: Duplication, deletion, and rearrangement in the
mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep
30;100(20):11484-9.

Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC,
Haussler D, Miller W. Human-Mouse Alignments with BLASTZ. Genome
Res. 2003 Jan;13(1):103-7.

I think that many of you may have similar thoughts to @T_aquaticus in response to the estimates I am giving, so I will respond as best I can.

The 96.5% figure is from within the alignment of the two genomes. To get an alignment, you have to do a search for the similar parts of the two genomes. This means that in the alignment you will only get parts of the genome that are similar. There could be parts of the genome that have been accurately sequenced but genuinely differ by over, say 25% (depending on exactly what threshold was set in making the alignments), but will be left out of the alignment.

If we want to calculate the similarity of the entire genomes, we need to take into account the possibility of such regions of the genome. If these regions happen to be large, they will lead to very different estimates than those estimates that are only based on the sequences that have been aligned. My calculations based on the most recent alignments of the human and chimpanzee genomes show that about 4% of the assembled human genome (excluding gaps in the human assembly) does not currently have an alignment to the assembled chimpanzee genome.

This figure could go down as the chimpanzee assembly is improved, as has been pointed out several times in this discussion. That is true. But last time there was a substantial improvement in the chimpanzee assembly (from PanTro4 to PanTro5), whilst 230,249,387 sequenced bases were added to the chimpanzee genome, the one-to-one orthology between the human and chimpanzee genomes only grew by 59,439,415 bases. I think this is probably to be expected, because the parts of genomes that are hard to assemble tend to be the most repetitive parts, which also tend to be the fastest evolving. Thus I do not expect the whole of the human genome to show alignment once the chimpanzee genome is complete. For similar reasons, I think it is unlikely that the 5% of the human genome assembly that is currently made up of gaps is likely to all be alignable to the chimpanzee genome (assuming sensible alignment thresholds).

However, note that in my “maximum” estimate figure in Post 40 above (i.e. 93.43% similarity between humans and chimps) I have made the unrealistic assumption that all of the regions of the human genome assembly that currently have no alignment to the chimp assembly, and all the regions that are currently gaps, are made up of sequences that have true one-to-one orthology with the chimpanzee genome.

Also, to come back to your 96.5% figure, if I understand correctly where you are getting this figure from it does not include copy number variants. The recent comparisons I have done of the latest alignments between human and chimpanzee genome assemblies suggest that about 5% of the human genome may be copy number variants of sequences in the chimpanzee genome. This figure may change as both genomes get better resolved, but it actually increased last time there was a major improvement in the chimpanzee genome assembly, from 4.86% with PanTro4 to 5.03% with PanTro6.

Thus, I don’t think it is unreasonable that once unaligned regions and copy number variants are included in a percentage similarity figure - and they have to be if we are to get a figure that summarises the entire genomes - that this will be quite different to 96.5%.

One day we will known the figure for sure. At the moment I posit that we may know enough to make a well-informed prediction.

I think it’s only fair to comment that I think this is a well-written, clear explanation. I have at times rankled at some of your comments (particularly your more tendentious summaries of Dennis’s positions), but this comment had none of that edge to it. To the contrary, I feel like I learned something, and if there are inaccuracies here, I’d find it helpful to understand them.

Now, at the same time, I come back to the question that has been posed a few times so far, which is, what exactly is the point behind haggling over a few percentage points here and there, except perhaps to cast aspersions on the common descent of humans and chimpanzees or to suggest some great scientific conspiracy that’s unduly driving the similarity numbers higher.

I am curious about this sort of meta-level question of “Why are we even asking this question?”. But as for the explanation itself, I found it lucid. Thanks for posting it!

4 Likes

And if I might add, What do we gain by answering the question? As far as I can tell not much of anything.

1 Like

Well, yes and no (imho). I mean, science is built on people repeating the experiments of others and validating their conclusions (or not, indeed!). One of the valid critiques from within the scientific community of the scientific community these days is that repeating others’ experiments isn’t seen as “sexy” enough for CVs and so it never gets done or published. I think there is value in double-checking science – any scientific findings – for precision. Now, in the wider context with the pattern of comments here on the BioLogos Forum, and placed within a sociological setting of science skepticism, it does make one wonder what the real point is. But I think the double-checking itself could be valuable. (Just my very uninformed opinion.)

NOTE: I am not an expert on genome sequencing, so you or others should feel free to correct any errors that I make. I am hoping @glipsnort will smack me upside the head and point out my mistakes.

From my limited understanding you have two things: the assembly and the alignment.

The genome assembly is the result of taking the small portions that were sequenced and connecting them into a longer sequence. No assembly is complete, so there will be gaps where it wasn’t clear which short sequences fit into that gap. This could be due to bad sequencing runs or an inability to determine where a specific short sequence fits into the larger sequence.

The alignment takes the human and chimp assembled sequences and puts orthologous sections side by side. Since there are gaps in each assembly there will be no sequence in one assembly where there is good sequence in the other genome. This doesn’t necessarily mean that there is a gap (i.e. indel) in one assembly. It simply means that they don’t know what the sequence of the DNA is in that region.

You seem to count these gaps in the assembly as missing DNA, and then count it as 0% similarity. That seems wrong to me. It isn’t a gap in the genome. It is a gap in the assembly. It is wrong to assume that the chimp and human genomes are 0% similar where there is a lack of sequence information. As those gaps are filled in I would strongly suspect that they would have the same average similarity as the other sections of the genomes where we do have sequence in both assemblies.

Isn’t it entirely possible that the 4% of the human genome that does not currently have an ortholog in the chimp genome is simply DNA sequence that has yet to be added to the chimp genome assembly?

Also, what happens when you use the same technique to compare the chimp and gorilla genomes, or the chimp and orangutan genomes? What about comparing two human genomes? Afterall, the prediction made by the theory of evolution is that we should see a phylogeny.

Pedantry is one of the highest art forms in the sciences. :wink:

The percentage difference would only matter if it were so large that the known mutation rate, generation time, and time since divergence could not produce those differences. From what I have read, this isn’t a problem.

The larger effort from the ID/creationist side seems to be obfuscation. The plan is to create as much doubt as possible, even if the possible errors they are pointing to are completely irrelevant to the larger picture.

This is, of course, how it appears to most of us. But I keep hoping that perhaps Richard will show us his cards and answer these sorts of questions himself, rather than content himself with letting us come up with our own explanations (which are often none too flattering). For better or worse, though, that doesn’t seem to be his style… unless I’m missing something (which is of course altogether possible given the speed at which I have skimmed much of these exchanges)…

1 Like

I am inclined to ask the same question of the entire “human genome and Adam” controversy that has taken so much comment here - what is the point of experimental data to the teachings of the Christian faith that God created the first true human couple? The only response that appears comprehensible to me at least (and I am certain I am in the minority??) is to indulge in YEC and ID “bashing” (as we say down under).

1 Like

I suppose the other side of the coin is claiming certainty when there is none. The entire argument rests on an assumption that a given difference in the comparison can lead, with certainty, to a given conclusion. This is either quantifiable and generally true for all species, or (within you cosy tent of certainty :laughing: ) it is an assumption, and the need for such an assumption, should be discussed and questioned by scientists - after all, that is the scientific method :wink:.