Human Chimp Genome Similarity

T_aquaticus · April 24, 2018, 10:09pm

But how can you say that there is not orthology when parts of the human and chimp genome are not part of that comparison? That seems to be the error you made from the very start.

To use an analogy, we have two incomplete jigsaw puzzles. One puzzle has 90% of the pieces put together, and the other puzzle also has 90% of the puzzle pieces put together. The problem is that it is not the same 90%. There are sections where there are pieces in one puzzle but not in the other. It is easy to see that pieces are nearly the same in the same positions, and the picture on each piece differs by 1.5% where there are pieces in both puzzles at the same position.

Now, would it be correct to say that the sections where there are not pieces in both puzzles are 0% identical? Of course not, right? You need the actual pieces in place before you can compare them. More to the point, why would you ever think that when you start filling out the remaining 10% in each puzzle that the pieces won’t be 1.5% different just like the rest of the puzzle pieces?

TGLarkin · April 24, 2018, 10:54pm

Thank you all for the great discussion. No percentage of commonality would ever prove or disprove a common ancestor, wouldn’t this be more dependent of the DNA sequence itself, such as common location of pseudogenes (nonfunctional genes), ancient repetitive elements (AREs), etc.?

Joel_Duff · April 25, 2018, 1:24am

All very interesting discussion but in the end the numbers, 70%, 86%, 98%, 98.6% don’t mean much without some sort of context. As. T. aquaticus has said above we need to know what we are comparing. But I would go a step further. Even if we agree what we are comparing what does the values we get at the end really tell us?
What I’ve been asking for (especially from YECs who taut the 70% or 85%) is an apples to apples/pairwise comparison. If a chimp and human are 85% using a particular algorithm why can’t that person use that same algorithm to tell us what the difference is between a neanderthal and human or Han Chinese and a African Bushman? I believe Joshua Swamidass has produced a similar comparison of mouse and rat and they are far more different using the same criteria than a human and chimp. We need some numbers to compare to to make sense of 86%. That sounds like a substantial difference (and that is always the point of the anti-evolutionist) but what if you an I are only 94% similar by the same measure are we going to start saying that I am not human but you are?

gbrooks9 · April 25, 2018, 3:43am

I’m glad that some participants are noting that this “70%… 98%” topic is not exactly the most persuasive lever of opinion on the matter of evolution.

But what is persuasive is being able to show how specific broken genes are shared between chimps and humans … and virtually no other animal groups.

T_aquaticus · April 25, 2018, 3:53pm

First and foremost, I think it would be dependent on the phylogenetic signal in the genetic and morphological data.

One could argue that too large of a gap between two genomes could not be explained by the accumulation of random mutations over a relatively short period of time (5-10 million years for chimp/human), but that certainly isn’t the case here. It then comes down the relative differences between species and the consistency of the data with the tree predicted by the theory of evolution. Phylogenetic signal is still the most basic and powerful piece of evidence for the theory.

You also mention repetitive elements, and they fit right into this paradigm. One great example are the LTRs found in endogenous retroviruses.

“Third, sequence divergence between the LTRs at the ends of a given provirus provides an important and unique source of phylogenetic information. The LTRs are created during reverse transcription to regenerate cis-acting elements required for integration and transcription. Because of the mechanism of reverse transcription, the two LTRs must be identical at the time of integration, even if they differed in the precursor provirus (Fig. 1A). Over time, they will diverge in sequence because of substitutions, insertions, and deletions acquired during cellular DNA replication.”
http://www.pnas.org/content/96/18/10254

When you plug these repetitive elements into algorithms that construct trees you get the expected species tree with very few exceptions. You can check them out in this figure from the paper linked above.

DennisVenema · April 25, 2018, 4:13pm

The goal in attempting to reduce the % difference has always been an attempt to cast doubt on common ancestry.

What I have noticed with YEC and anti-common descent ID is that there is seldom (if ever?) an attempt to make sense of the entire sweep of the data supporting common ancestry. We see attempts to diminish the % identity. We see attempts to show that a few rare pseudogenes have been exapted and now have a new function. We see aspersions cast on incomplete lineage sorting, and so on - but we don’t see a coherent case that explains the data better from an antievolutionary perspective. We don’t even see articles that tackle these diverse lines of evidence in the same article, lest the readership notice the problems, I guess.

The one counterexample I know is Todd Wood’s 2006 article, and he concludes that YECs don’t have a good explanation. That’s it.

Jonathan_Burke · April 25, 2018, 6:20pm

Exactly. When we’re dealing with people who say things like this, the aim is clear.

“We are not attacking the teaching of Darwinian theory, we are just saying that criticisms of Darwin’s theory should also be taught” (Teach The Controversy strategy)
“But the fact is that we are still unable even to guess Darwinian pathways for the origin of most complex biological structures” (argument from ignorance)
“More recently, proponents of ID predicted that some “junk” DNA must have a function well before this view became mainstream among Darwinists.” (note “proponents of ID” versus “Darwinists”)
“In fact, ID is a logical inference, based on data gathered from the natural world, and hence it is firmly in the realm of science.” (ID is science)
“I do not know of a good evolutionary pathway for the development of the bacterial flagellum. In his latest book, Professor Richard Dawkins identifies a single possible intermediate step. This hardly constitutes a pathway.” (argument for irreducible complexity)

In summary:

The arguments are presented in the style of an educational film, and are generally presented among needlessly lengthy scientific descriptions and impressive visuals, which help to make creationist arguments sound reasonable to anyone without scientific training in the relevant disciplines. Anyone familiar with creationists will recognize their standard tactics including appeals to emotion, argument from ignorance, misdirection and occasionally blatant falsehoods, – Science, Just Science, October 2006.

T_aquaticus · April 25, 2018, 8:43pm

YEC and ID have always been about creating the thinnest of scientific veneers so that people could feel justified in rejecting evolution. Explaining the evidence has never been their intent.

glipsnort · April 28, 2018, 5:23pm

Since we have already identified 1.5% of the human genome as being unique to our species, of course I agree.

It’s not an analysis I’m interested in.

How much additional sequencing has been done of the chimpanzee genome?

RichardBuggs · May 1, 2018, 8:28am

Hi all,

I have done some more calculations for you to seek to come up with an upper bound of similarity between human and chimpanzee genomes, as well as a lower bound. This is based on the most recent alignments between human and chimpanzee genomes, available at the UCSC genomics website (hg38 versus PanTro6) see Index of /goldenpath/hg38/vsPanTro6

By my calculations, the human genome has between 84.4% and 93.4% one-to-one orthology with the chimpanzee genome. The uncertainty makes allowance for possible current incompleteness in our knowledge of both the human and chimpanzee genomes. The upper bound of 93.4% assumes that all the regions of the human and chimpanzee genomes that we have not yet assembled and/or aligned will prove to have one-to-one orthology between humans and chimps. The lower bound of 84.4% assumes that these regions will prove to be different between humans and chimpanzees. I have assumed throughout that further sequencing is unlikely to significantly alter currently known differences due to SNPs, indels, and copy number variation.

Here is how I did the calculation. I downloaded both the reciprocal best (“rbest”) alignment, and the “net” alignment (this allows copy number variation from UCSC for hg38 and PanTro6). Using custom PERL scripts, I measured the length of each alignment, and the number of SNPs and insertions in each one. I looked up the “Total assembly gap length” in the Hg38 human genome assembly statistics online (only a negligible number of these were present as Ns in the alignments). These are only known rather approximately, and many are estimated to the nearest 1000 or 10,000 bases.

Here are the stats I calculated for each alignment:

I used the differences between the two alignments to work out how much of the net.axt alignment was due to copy number variation, and how many SNPs and insertions there seem to be within copy number variants.

This yielded the following statistics for the overall similarity between the human and chimpanzee genome:

In my view, the upper bound of 93.4% is unlikely to prove to be the true value, once we have complete assemblies for both the human and chimpanzee genomes. This is because the regions of genomes that are hardest to assemble tend to be areas that are very repetitive, or fast evolving, or both. I therefore think it is unlikely that the 4.98% of the human genome that is represented by gaps in the hg38 assembly will prove to be orthologous to the chimpanzee genome. In addition, not all regions of the human genome that currently have no alignment look as if they are highly repetitive (though some are). I think that if these non-repetitive regions were present in the chimpanzee genome they would have been successfully sequenced and assembled by now.

I welcome feedback on these calculations and the methods and assumptions behind them, and especially identification of any errors I may have made.

Best wishes,

Richard

gbrooks9 · May 1, 2018, 2:13pm

@RichardBuggs

Dr. Buggs, what do you hope to demonstrate by having a solid percentage calculated?

Isn’t the Devil in the Detail of how many chromosomal features are shared between humans and chimps… but not as well as other branches of the Great Ape section of the Primate tree?

T_aquaticus · May 1, 2018, 3:30pm

I don’t see how you can make that claim given the fact that the human and chimp assembled genomes are not complete. There are gaps in both genomes where they don’t have good sequence or don’t have enough data to accurately place good sequence within each genome.

Shouldn’t that be closer to 96.5%? That is the figure in the chimp genome paper for orthologous regions. Do you have a reason why this figure would change so drastically?

glipsnort · May 1, 2018, 3:33pm

Are you assuming copy number variants are called correctly in both genomes?

RichardBuggs · May 2, 2018, 9:00pm

Hi Steve,

I have quite a backlog of questions to deal with, and only half an hour free, so I will try to tick off one of the major ones.

In response to my comment:

You said

As far as I can make out from NCBI, PanTro 3 and 4 were based on 6x Sanger genome coverage. PanTro5 had an additional 55x coverage of Illumina overlapping paired 250bp length reads, 2 Lanes of a Chicago library (Hi-C from Dovetail Genomics) and 9x coverage of PacBio long single molecule reads. The total sequence length of PanTro5 is 3,231,154,112bp (ungapped length =3,132,603,083bp), whereas PanTro4 is 3,309,561,368 (2,902,353,696bp ungapped).

I think it is worth noting that although PanTro5 has considerably more data that PanTro4, and is 8% longer in its ungapped length, it has only yielded an increase of one-to-one orthology with the human genome of 1.9% (from 82.3% to 84.2%).

I can’t find anything online about PanTro6. If you have access to information about how this is an improvement on PanTro5, I would be very grateful.

RichardBuggs · May 3, 2018, 8:42pm

Hi all,

To explore further the effect of improvement of the chimpanzee genome assembly on percentage similarity estimates for entire human and chimpanzee genomes, I have done the same calculation for the PanTro4 genome assembly (based on 6X Sanger read coverage) as I did earlier for the PanTro6 genome assembly (based on a lot more sequence data and covering more of the chimp genome - see my previous post).

Here are the two sets of stats side by side:

The improved chimpanzee genome has led to slightly greater precision in my estimates of human chimpanzee percentage similarity: the minimum is raised, and the maximum is slightly reduced.

One thing that I was not necessarily expecting is that the size of copy number variant (CNV) regions seem to have increased in size with the improved chimpanzee genome assembly (as shown by the “Paralogs in axtnet alignment” row). I had wondered if improved chimpanzee coverage might decrease this figure, as repeats in the chimpanzee became better resolved, but this does not seem to have occurred.

As ever, I welcome critiques and suggestions.

RichardBuggs · May 8, 2018, 9:51pm

Hi all,
I would be very grateful if someone was willing to move this discussion forward by making a case, based on current data, that the entire human genome is over 94% identical to the chimpanzee genome.
many thanks,
Richard

gbrooks9 · May 8, 2018, 10:16pm

What if someone said, it doesn’t matter what the exact percentage is?

glipsnort · May 8, 2018, 10:55pm

Since I know nothing about the quality or properties of the alignment being used, I have no idea what value could be extracted from it.

Bill_II · May 9, 2018, 1:06am

Silly question, but what is the value in coming up with a value? To a lay person once you get near 90% that is in the almost identical region.

For 2 randomly selected living humans what kind of similarity percentage could you expect?

RichardBuggs · May 9, 2018, 9:42am

Hi Steve,

See: Index of /goldenpath/hg38/vsPanTro6 and Index of /goldenpath/hg38/vsPanTro4

Here is the information available at the first URL above:

This directory contains alignments of the following assemblies:

target/reference: Human
(hg38, Dec. 2013 (GRCh38/hg38),
GRCh38 Genome Reference Consortium Human Reference 38 (GCA_000001405.15))
query: Chimp
(panTro6, Jan. 2018 (Clint_PTRv2/panTro6),
University of Washington)

Files included in this directory:

md5sum.txt: md5sum checksums for the files in this directory
hg38.panTro6.all.chain.gz: chained lastz alignments. The chain format is
described in Genome Browser Chain Format .
hg38.panTro6.net.gz: “net” file that describes rearrangements between
the species and the best Chimp match to any part of the
Human genome. The net format is described in
Genome Browser Net Format .
hg38.panTro6.net.axt.gz: chained and netted alignments,
i.e. the best chains in the Human genome, with gaps in the best
chains filled in by next-best chains where possible. The axt format is
described in Genome Browser axt Alignment Format .
hg38.panTro6.synNet.maf.gz - filtered net file for syntenic alignments
only, in MAF format, see also, description of MAF format:
Genome Browser FAQ
hg38.panTro6.syn.net.gz - filtered net file for syntenic alignments only
reciprocalBest/ directory, contains reciprocal-best netted chains
for hg38-panTro6

The hg38 and panTro6 assemblies were aligned by the lastz alignment
program, which is available from Webb Miller’s lab at Penn State
University (CCGB: Miller Lab). Any hg38 sequences larger
than 20,010,000 bases were split into chunks of 20,010,000 bases overlapping
by 10,000 bases for alignment. A similar process was followed for panTro6,
with chunks of 20,000,000 overlapping by 0. Following alignment, the
coordinates of the chunk alignments were corrected by the
blastz-normalizeLav script written by Scott Schwartz of Penn State.

The lastz scoring matrix (Q parameter) used was:

        A    C    G    T
  A     90 -330 -236 -356
  C   -330  100 -318 -236
  G   -236 -318  100 -330
  T   -356 -236 -330   90

with a gap open penalty of O=600 and a gap extension penalty of E=150.
The minimum score for an alignment to be kept was K=4500 for the first pass
and L=4500 for the second pass, which restricted the search space to the
regions between two alignments found in the first pass. The minimum
score for alignments to be interpolated between was H=2000. Other blastz
parameters specifically set for this species pair:
E=150
M=254
O=600
T=2
Y=15000

The .lav format lastz output was translated to the .psl format with
lavToPsl, then chained by the axtChain program.

Chain minimum score: 5000, and linearGap matrix of (medium):
tableSize 11
smallSize 111
position 1 2 3 11 111 2111 12111 32111 72111 152111 252111
qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900
tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900
bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300

Chained alignments were processed into nets by the chainNet, netSyntenic,
and netClass programs.
Best-chain alignments in axt format were extracted by the netToAxt program.
All programs run after lastz were written by Jim Kent at UCSC.

References

Chiaromonte F, Yap VB, Miller W. Scoring pairwise genomic sequence
alignments. Pac Symp Biocomput. 2002:115-26.

Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D.
Evolution’s cauldron: Duplication, deletion, and rearrangement in the
mouse and human genomes. Proc Natl Acad Sci U S A. 2003 Sep
30;100(20):11484-9.

Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC,
Haussler D, Miller W. Human-Mouse Alignments with BLASTZ. Genome
Res. 2003 Jan;13(1):103-7.