Human Chimp Genome Similarity


#30

First and foremost, I think it would be dependent on the phylogenetic signal in the genetic and morphological data.

One could argue that too large of a gap between two genomes could not be explained by the accumulation of random mutations over a relatively short period of time (5-10 million years for chimp/human), but that certainly isn’t the case here. It then comes down the relative differences between species and the consistency of the data with the tree predicted by the theory of evolution. Phylogenetic signal is still the most basic and powerful piece of evidence for the theory.

You also mention repetitive elements, and they fit right into this paradigm. One great example are the LTRs found in endogenous retroviruses.

"Third, sequence divergence between the LTRs at the ends of a given provirus provides an important and unique source of phylogenetic information. The LTRs are created during reverse transcription to regenerate cis-acting elements required for integration and transcription. Because of the mechanism of reverse transcription, the two LTRs must be identical at the time of integration, even if they differed in the precursor provirus (Fig. 1A). Over time, they will diverge in sequence because of substitutions, insertions, and deletions acquired during cellular DNA replication."

When you plug these repetitive elements into algorithms that construct trees you get the expected species tree with very few exceptions. You can check them out in this figure from the paper linked above.


(Dennis Venema) #31

The goal in attempting to reduce the % difference has always been an attempt to cast doubt on common ancestry.

What I have noticed with YEC and anti-common descent ID is that there is seldom (if ever?) an attempt to make sense of the entire sweep of the data supporting common ancestry. We see attempts to diminish the % identity. We see attempts to show that a few rare pseudogenes have been exapted and now have a new function. We see aspersions cast on incomplete lineage sorting, and so on - but we don’t see a coherent case that explains the data better from an antievolutionary perspective. We don’t even see articles that tackle these diverse lines of evidence in the same article, lest the readership notice the problems, I guess.

The one counterexample I know is Todd Wood’s 2006 article, and he concludes that YECs don’t have a good explanation. That’s it.


(Jon) #32

Exactly. When we’re dealing with people who say things like this, the aim is clear.

  • “We are not attacking the teaching of Darwinian theory, we are just saying that criticisms of Darwin’s theory should also be taught” (Teach The Controversy strategy)

  • “But the fact is that we are still unable even to guess Darwinian pathways for the origin of most complex biological structures” (argument from ignorance)

  • “More recently, proponents of ID predicted that some “junk” DNA must have a function well before this view became mainstream among Darwinists.” (note “proponents of ID” versus “Darwinists”)

  • “In fact, ID is a logical inference, based on data gathered from the natural world, and hence it is firmly in the realm of science.” (ID is science)

  • “I do not know of a good evolutionary pathway for the development of the bacterial flagellum. In his latest book, Professor Richard Dawkins identifies a single possible intermediate step. This hardly constitutes a pathway.” (argument for irreducible complexity)

In summary:

The arguments are presented in the style of an educational film, and are generally presented among needlessly lengthy scientific descriptions and impressive visuals, which help to make creationist arguments sound reasonable to anyone without scientific training in the relevant disciplines. Anyone familiar with creationists will recognize their standard tactics including appeals to emotion, argument from ignorance, misdirection and occasionally blatant falsehoods, – Science, Just Science, October 2006.


#33

YEC and ID have always been about creating the thinnest of scientific veneers so that people could feel justified in rejecting evolution. Explaining the evidence has never been their intent.


(Steve Schaffner) #34

Since we have already identified 1.5% of the human genome as being unique to our species, of course I agree.

It’s not an analysis I’m interested in.

How much additional sequencing has been done of the chimpanzee genome?


(Richard Buggs) #35

Hi all,

I have done some more calculations for you to seek to come up with an upper bound of similarity between human and chimpanzee genomes, as well as a lower bound. This is based on the most recent alignments between human and chimpanzee genomes, available at the UCSC genomics website (hg38 versus PanTro6) see http://hgdownload.cse.ucsc.edu/goldenpath/hg38/vsPanTro6/

By my calculations, the human genome has between 84.4% and 93.4% one-to-one orthology with the chimpanzee genome. The uncertainty makes allowance for possible current incompleteness in our knowledge of both the human and chimpanzee genomes. The upper bound of 93.4% assumes that all the regions of the human and chimpanzee genomes that we have not yet assembled and/or aligned will prove to have one-to-one orthology between humans and chimps. The lower bound of 84.4% assumes that these regions will prove to be different between humans and chimpanzees. I have assumed throughout that further sequencing is unlikely to significantly alter currently known differences due to SNPs, indels, and copy number variation.

Here is how I did the calculation. I downloaded both the reciprocal best (“rbest”) alignment, and the “net” alignment (this allows copy number variation from UCSC for hg38 and PanTro6). Using custom PERL scripts, I measured the length of each alignment, and the number of SNPs and insertions in each one. I looked up the “Total assembly gap length” in the Hg38 human genome assembly statistics online (only a negligible number of these were present as Ns in the alignments). These are only known rather approximately, and many are estimated to the nearest 1000 or 10,000 bases.

Here are the stats I calculated for each alignment:
52

I used the differences between the two alignments to work out how much of the net.axt alignment was due to copy number variation, and how many SNPs and insertions there seem to be within copy number variants.

This yielded the following statistics for the overall similarity between the human and chimpanzee genome:
07

In my view, the upper bound of 93.4% is unlikely to prove to be the true value, once we have complete assemblies for both the human and chimpanzee genomes. This is because the regions of genomes that are hardest to assemble tend to be areas that are very repetitive, or fast evolving, or both. I therefore think it is unlikely that the 4.98% of the human genome that is represented by gaps in the hg38 assembly will prove to be orthologous to the chimpanzee genome. In addition, not all regions of the human genome that currently have no alignment look as if they are highly repetitive (though some are). I think that if these non-repetitive regions were present in the chimpanzee genome they would have been successfully sequenced and assembled by now.

I welcome feedback on these calculations and the methods and assumptions behind them, and especially identification of any errors I may have made.

Best wishes,

Richard


(George Brooks) #36

@RichardBuggs

Dr. Buggs, what do you hope to demonstrate by having a solid percentage calculated?

Isn’t the Devil in the Detail of how many chromosomal features are shared between humans and chimps… but not as well as other branches of the Great Ape section of the Primate tree?


#37

I don’t see how you can make that claim given the fact that the human and chimp assembled genomes are not complete. There are gaps in both genomes where they don’t have good sequence or don’t have enough data to accurately place good sequence within each genome.

Shouldn’t that be closer to 96.5%? That is the figure in the chimp genome paper for orthologous regions. Do you have a reason why this figure would change so drastically?


(Steve Schaffner) #38

Are you assuming copy number variants are called correctly in both genomes?


(Richard Buggs) #39

Hi Steve,

I have quite a backlog of questions to deal with, and only half an hour free, so I will try to tick off one of the major ones.

In response to my comment:

You said

As far as I can make out from NCBI, PanTro 3 and 4 were based on 6x Sanger genome coverage. PanTro5 had an additional 55x coverage of Illumina overlapping paired 250bp length reads, 2 Lanes of a Chicago library (Hi-C from Dovetail Genomics) and 9x coverage of PacBio long single molecule reads. The total sequence length of PanTro5 is 3,231,154,112bp (ungapped length =3,132,603,083bp), whereas PanTro4 is 3,309,561,368 (2,902,353,696bp ungapped).

I think it is worth noting that although PanTro5 has considerably more data that PanTro4, and is 8% longer in its ungapped length, it has only yielded an increase of one-to-one orthology with the human genome of 1.9% (from 82.3% to 84.2%).

I can’t find anything online about PanTro6. If you have access to information about how this is an improvement on PanTro5, I would be very grateful.


(Richard Buggs) #40

Hi all,

To explore further the effect of improvement of the chimpanzee genome assembly on percentage similarity estimates for entire human and chimpanzee genomes, I have done the same calculation for the PanTro4 genome assembly (based on 6X Sanger read coverage) as I did earlier for the PanTro6 genome assembly (based on a lot more sequence data and covering more of the chimp genome - see my previous post).

Here are the two sets of stats side by side:

The improved chimpanzee genome has led to slightly greater precision in my estimates of human chimpanzee percentage similarity: the minimum is raised, and the maximum is slightly reduced.

One thing that I was not necessarily expecting is that the size of copy number variant (CNV) regions seem to have increased in size with the improved chimpanzee genome assembly (as shown by the “Paralogs in axtnet alignment” row). I had wondered if improved chimpanzee coverage might decrease this figure, as repeats in the chimpanzee became better resolved, but this does not seem to have occurred.

As ever, I welcome critiques and suggestions.


(Richard Buggs) #41

Hi all,
I would be very grateful if someone was willing to move this discussion forward by making a case, based on current data, that the entire human genome is over 94% identical to the chimpanzee genome.
many thanks,
Richard


(George Brooks) #42

What if someone said, it doesn’t matter what the exact percentage is?


(Steve Schaffner) #43

Since I know nothing about the quality or properties of the alignment being used, I have no idea what value could be extracted from it.


#44

Silly question, but what is the value in coming up with a value? To a lay person once you get near 90% that is in the almost identical region.

For 2 randomly selected living humans what kind of similarity percentage could you expect?


(Richard Buggs) #45

Hi Steve,

See: http://hgdownload.cse.ucsc.edu/goldenpath/hg38/vsPanTro6/ and http://hgdownload.cse.ucsc.edu/goldenpath/hg38/vsPanTro4/

Here is the information available at the first URL above:


This directory contains alignments of the following assemblies:

  • target/reference: Human
    (hg38, Dec. 2013 (GRCh38/hg38),
    GRCh38 Genome Reference Consortium Human Reference 38 (GCA_000001405.15))

  • query: Chimp
    (panTro6, Jan. 2018 (Clint_PTRv2/panTro6),
    University of Washington)

Files included in this directory:

  • md5sum.txt: md5sum checksums for the files in this directory

  • hg38.panTro6.all.chain.gz: chained lastz alignments. The chain format is
    described in http://genome.ucsc.edu/goldenPath/help/chain.html .

  • hg38.panTro6.net.gz: “net” file that describes rearrangements between
    the species and the best Chimp match to any part of the
    Human genome. The net format is described in
    http://genome.ucsc.edu/goldenPath/help/net.html .

  • hg38.panTro6.net.axt.gz: chained and netted alignments,
    i.e. the best chains in the Human genome, with gaps in the best
    chains filled in by next-best chains where possible. The axt format is
    described in http://genome.ucsc.edu/goldenPath/help/axt.html .

  • hg38.panTro6.synNet.maf.gz - filtered net file for syntenic alignments
    only, in MAF format, see also, description of MAF format:
    http://genome.ucsc.edu/FAQ/FAQformat.html#format5

  • hg38.panTro6.syn.net.gz - filtered net file for syntenic alignments only

  • reciprocalBest/ directory, contains reciprocal-best netted chains
    for hg38-panTro6

The hg38 and panTro6 assemblies were aligned by the lastz alignment
program, which is available from Webb Miller’s lab at Penn State
University (http://www.bx.psu.edu/miller_lab/). Any hg38 sequences larger
than 20,010,000 bases were split into chunks of 20,010,000 bases overlapping
by 10,000 bases for alignment. A similar process was followed for panTro6,
with chunks of 20,000,000 overlapping by 0. Following alignment, the
coordinates of the chunk alignments were corrected by the
blastz-normalizeLav script written by Scott Schwartz of Penn State.

The lastz scoring matrix (Q parameter) used was:

        A    C    G    T
  A     90 -330 -236 -356
  C   -330  100 -318 -236
  G   -236 -318  100 -330
  T   -356 -236 -330   90

with a gap open penalty of O=600 and a gap extension penalty of E=150.
The minimum score for an alignment to be kept was K=4500 for the first pass
and L=4500 for the second pass, which restricted the search space to the
regions between two alignments found in the first pass. The minimum
score for alignments to be interpolated between was H=2000. Other blastz
parameters specifically set for this species pair:
E=150
M=254
O=600
T=2
Y=15000

The .lav format lastz output was translated to the .psl format with
lavToPsl, then chained by the axtChain program.

Chain minimum score: 5000, and linearGap matrix of (medium):
tableSize 11
smallSize 111
position 1 2 3 11 111 2111 12111 32111 72111 152111 252111
qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900
tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900
bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300

Chained alignments were processed into nets by the chainNet, netSyntenic,
and netClass programs.
Best-chain alignments in axt format were extracted by the netToAxt program.
All programs run after lastz were written by Jim Kent at UCSC.

References

Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence
alignments. Pac Symp Biocomput. 2002:115-26.

Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D.
Evolution’s cauldron: Duplication, deletion, and rearrangement in the
mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep
30;100(20):11484-9.

Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC,
Haussler D, Miller W. Human-Mouse Alignments with BLASTZ. Genome
Res. 2003 Jan;13(1):103-7.


(Richard Buggs) #46

I think that many of you may have similar thoughts to @T_aquaticus in response to the estimates I am giving, so I will respond as best I can.

The 96.5% figure is from within the alignment of the two genomes. To get an alignment, you have to do a search for the similar parts of the two genomes. This means that in the alignment you will only get parts of the genome that are similar. There could be parts of the genome that have been accurately sequenced but genuinely differ by over, say 25% (depending on exactly what threshold was set in making the alignments), but will be left out of the alignment.

If we want to calculate the similarity of the entire genomes, we need to take into account the possibility of such regions of the genome. If these regions happen to be large, they will lead to very different estimates than those estimates that are only based on the sequences that have been aligned. My calculations based on the most recent alignments of the human and chimpanzee genomes show that about 4% of the assembled human genome (excluding gaps in the human assembly) does not currently have an alignment to the assembled chimpanzee genome.

This figure could go down as the chimpanzee assembly is improved, as has been pointed out several times in this discussion. That is true. But last time there was a substantial improvement in the chimpanzee assembly (from PanTro4 to PanTro5), whilst 230,249,387 sequenced bases were added to the chimpanzee genome, the one-to-one orthology between the human and chimpanzee genomes only grew by 59,439,415 bases. I think this is probably to be expected, because the parts of genomes that are hard to assemble tend to be the most repetitive parts, which also tend to be the fastest evolving. Thus I do not expect the whole of the human genome to show alignment once the chimpanzee genome is complete. For similar reasons, I think it is unlikely that the 5% of the human genome assembly that is currently made up of gaps is likely to all be alignable to the chimpanzee genome (assuming sensible alignment thresholds).

However, note that in my “maximum” estimate figure in Post 40 above (i.e. 93.43% similarity between humans and chimps) I have made the unrealistic assumption that all of the regions of the human genome assembly that currently have no alignment to the chimp assembly, and all the regions that are currently gaps, are made up of sequences that have true one-to-one orthology with the chimpanzee genome.

Also, to come back to your 96.5% figure, if I understand correctly where you are getting this figure from it does not include copy number variants. The recent comparisons I have done of the latest alignments between human and chimpanzee genome assemblies suggest that about 5% of the human genome may be copy number variants of sequences in the chimpanzee genome. This figure may change as both genomes get better resolved, but it actually increased last time there was a major improvement in the chimpanzee genome assembly, from 4.86% with PanTro4 to 5.03% with PanTro6.

Thus, I don’t think it is unreasonable that once unaligned regions and copy number variants are included in a percentage similarity figure - and they have to be if we are to get a figure that summarises the entire genomes - that this will be quite different to 96.5%.

One day we will known the figure for sure. At the moment I posit that we may know enough to make a well-informed prediction.


(A.M. Wolfe) #47

I think it’s only fair to comment that I think this is a well-written, clear explanation. I have at times rankled at some of your comments (particularly your more tendentious summaries of Dennis’s positions), but this comment had none of that edge to it. To the contrary, I feel like I learned something, and if there are inaccuracies here, I’d find it helpful to understand them.

Now, at the same time, I come back to the question that has been posed a few times so far, which is, what exactly is the point behind haggling over a few percentage points here and there, except perhaps to cast aspersions on the common descent of humans and chimpanzees or to suggest some great scientific conspiracy that’s unduly driving the similarity numbers higher.

I am curious about this sort of meta-level question of “Why are we even asking this question?”. But as for the explanation itself, I found it lucid. Thanks for posting it!


#48

And if I might add, What do we gain by answering the question? As far as I can tell not much of anything.


(A.M. Wolfe) #49

Well, yes and no (imho). I mean, science is built on people repeating the experiments of others and validating their conclusions (or not, indeed!). One of the valid critiques from within the scientific community of the scientific community these days is that repeating others’ experiments isn’t seen as “sexy” enough for CVs and so it never gets done or published. I think there is value in double-checking science – any scientific findings – for precision. Now, in the wider context with the pattern of comments here on the BioLogos Forum, and placed within a sociological setting of science skepticism, it does make one wonder what the real point is. But I think the double-checking itself could be valuable. (Just my very uninformed opinion.)