Human and Chimp Similarity

Swamidass · July 24, 2017, 4:41am

Continuing the discussion from Do Evolutionary Theory And Scripture Contradict Each Other?:

While we won’t know what the chimp genome really looks like until more accurate research is done, I recently did a study of the chimp reads that have lower levels of human DNA contamination, and in this newer study the chimp DNA is only 85% similar to human at best, not 98%.

Thomkins The Untold Story Behind DNA Similarity | Answers in Genesis

There are many ways to do it. Some are better than others. The exact number isn’t important. The key thing is to lay the number alongside controls Scientific control - Wikipedia. It is the relationship between the computed similarity and the controls that allows us to interpret it.

Tompkins proposal is reasonable. He suggests measuring similarity between chimp “reads” (short fragments of raw data) and the human assembled genome, to eliminate bias introduced by using the human genome to scaffold the chimp genome. That is a good idea, clever really. Let’s use it.

So Tomkins takes chimp reads, and computes the average similarity between these reads and the reference genome. He computes about 85% (chimp read → human genome). I agree that is what he got, and I get the same number as him. But how much is because of error in the reads? There are no controls, so we do not really know.

85% = (chimp read → human genome)

How do we solve this? We add controls. Let’s try a few. We could add some more data to our analysis. Let’s say we look at the chimp genome too. Here is approximately what we get…

87% = (chimp read → chimp genome)

Hmm. That isn’t right. No chimp is that different from their genome. We expect something closer to 100%. We can try another control. How about adding human reads, and seeing what we see there.

89% = (human read → human genome)
87% = (human read → chimp genome)

Hmm, so we see the same problem here. All humans are less than 0.5% different, so something clearly wrong is happening. What is going on?

It turns out this identifies a big problem, that would be obvious to those who sequence genomes. There is a lot of random error in reads (it is raw data after all). The error in the final genome is a lot lower, because the errors in individual reads cancels out. The error in the reads, however, is artificially lower similarity computed against the genome.

Is there a way to fix this? Yes there is! We can subtract out the amount of error in the data, by bringing chimp and human reads close to 100% when measured against their own genomes.

There are two ways to compute the similarity between humans and chimps this way…

98% = 100 - (human read → human genome) + (human read → chimp genome)
98% = 100 - (chimp read → chimp genome) + (chimp read → human genome)

We can do this over and over again, for every individual (human or chimp) that we have data. We have a lot of a lot like this, and the percent difference comes out, by Tompkins method, to be about 2% different or 98% the same, when you take the controls into account to subtract out the sequencing error.

But what does 2% or 98% mean any ways? How do we interpret that? Controls to the rescue. Let’s take mice and rats, animals most YECs think are of the same kind. “Microevolution” (to borrow their term) can account for the differences here. We can measure this the same way we measured the difference between human chimp. It is critical to measure the same way, so we can compare the numbers. We get approximately…

82% (mice - rats)

And that is compared with…

98% (human - chimp)

In evolutionary theory, there is mathematical theory that explains strange result. We can predict that there will be about 10x more differences between mice-rat than human-chimp (18% vs 2%), just as we see in the data. In the YEC world, this is clear evidence that humans and chimps genomes look like they are the same kind. Maybe God made us separate, but disproving evolution was not one of his design goals.

Of course, if you do not like my correction to Thompkin figures, you could always just look at the mice read to rat genome numbers. The uncorrected number is about…

70% = (mice read → rat genome)
vs.
85% = (chimp read → human genome)

Which is clearly below 85%, leaving us with the same interpretation. Humans and chimps are more similar than mice and rats. This is explained by the mathematical formulas of evolution, but is strange in YEC. At the very least, it tells us that God is not nearly as concerned about disprove evolution as we are.

NOTE: The numbers here are approximate, rounded off for clarity of text.

T_aquaticus · July 24, 2017, 10:25pm

Tomkins makes some rather bad mistakes in his comparison, however. The most glaring error is that he used an ungapped comparison.

"BLASTN algorithm parameters for the main study were as follows: -word_size 11, -evalue 10, -max_target_seqs 1, -dust no, -soft_masking false, -ungapped. "

Why does this matter? Let’s use a short random DNA sequence to show why this is a problem:

ungapped comparison: 17/23 are the same for 73% identity

ACGGTGTACGTACCGACGGGACT ACGGTGTACGTACCACGGGACT

gapped comparison: 22/23 are the same for 96% identity

ACGGTGTACGTACCGACGGGACT ACGGTGTACGTACC-ACGGGACT

People have taken Tomkins methods and changed it from ungapped to gapped and arrived at an overall similarity of 96.9% when counting each base of an indel as a separate mutation. If you count a 50 bp indel as a single mutation, then you get the expected 98% similarity using Tomkins method with a gapped analysis.

https://uncommondescent.com/intelligent-design/human-and-chimp-dna-they-really-are-about-98-similar/

This is a case of a creationists purposefully misusing a tool to get the wrong answer.

Swamidass · July 26, 2017, 2:01am

Sorry, but that is from an earlier study. This one does use the gapped parameter. You can look it up here…

As I said, I’m very familiar with this work.

Swamidass · July 26, 2017, 2:07am

@J.E.S Did you get a chance to see this? I did my best to explain.

pevaquark · July 26, 2017, 2:29am

In general, I am not sure what anti-evolutionists really want or expect? I know their main goal is to analyze the data to show a lower percentage. But how low must it go (and please anti-evolutionists, publish your results in a real scientific journal) for them to disprove common descent? Or alternatively, how high is just too high?

Would 90% be too high? Or 75% be too high? How about 50% being too high? Surely that could still indicate common descent and not common design? This is a generic question to anybody who has a good answer.

pevaquark · July 26, 2017, 2:52am

I went ahead and deleted the phrase for clarity. Also do you have a reference to the paper which illustrates or shows the mathematical prediction between humans/chimps vs. rats/mice?

Swamidass · July 26, 2017, 3:03am

http://biologos.org/blogs/guest/cancer-and-evolution footnote 5

Evidence and Evolution footnote 2

https://uncommondescent.com/intelligent-design/in-defense-of-swamidass/

So the neutral molecular clock ticks twice as fast for rats and mice as it does for primates. Multiply that by the three-fold difference between the 18-million-year-old mouse-rat divergence date estimated by evolutionists and the 6-million-year-old human-chimp divergence date, and you get an expected level of genetic divergence which is just six times greater – and not two orders of magnitude (or 100 times) greater, as calculated by Dr. Hunter. This figure of a six-fold difference comports well with the ten-fold genetic divergence reported by Professor Swamidass in footnote 2 of his article: at least 15% of the codons in rats and mice are different, compared with less than 1.5% in humans and chimps.

benkirk · July 26, 2017, 3:03am

It’s about nested hierarchies, not percentages.

pevaquark · July 26, 2017, 3:28am

Oh I agree, I was just having flashbacks to reading articles in my YEC days where they always sought to lower the percentage and once they mathematically proved it wasn’t nearly as high as scientists believed the only logical conclusion in their articles was that God just made us 6,000 years ago without any reasoning as to why 80% was really any better than 98%.

Even though I was already convinced, the percentage argument for common descent is quite impressive though based upon the well explained results of @Swamidass. Thanks for sharing!

T_aquaticus · July 26, 2017, 8:27pm

You mean this one?

“Overall, the basic statistics for the 101 data sets as a whole were as follows: The average alignment identify was 96.3% with an average length of 677 bases and 27 bases on average not aligning.”

96.3% is very close to what is reported in the literature.

Also, Tomkins gets this part very wrong:

“It is noteworthy that the alignable DNA similarity of about 96% omits sequences that are too dissimilar to align onto human and thus inflates the actual overall genome similarity between chimpanzees and human.”

That is not what is meant by sequence that is not aligned. In order to align DNA you need overlapping reads so that you know where in the chromosome it belongs. For unaligned DNA there could very well be a strong match between the two genomes, but because they don’t know where these pieces of DNA belong within each genome they leave them out of the comparison.

benkirk · July 26, 2017, 9:35pm

OK, I just wanted to know if you understood that.

Have you ever seen a YEC article that deals with–or even mentions–twin nested hierarchies?

Swamidass · July 26, 2017, 11:16pm

Walter ReMine’s Biotic Message sure mentions nested hierarchies. That is his whole point.

He correctly points out that perfect nested hierarchies are not predicted by evolution. Then he constructs a theology that guesses that there is a reason God made everything with perfected nested hierarchies, including to disprove evolution. Based on his observation of nested hierarchies, he concludes that evolution is false.

Of course, in nature, we see do not see perfect nested hierarchies, but nested hierarchies + some noise, just as we expect from evolution. So it was a nice idea, but it fails because of the evidence. Before you jump on him though, he should get some credit for proposing a testable model. Though he loses some credit for not realizing his model was falsified.

The interesting thing about ReMine is that he argues the exact opposite case of everyone else on this point.

benkirk · July 27, 2017, 2:33pm

By what criterion/criteria do you consider ReMine to be a YEC?

T_aquaticus · July 27, 2017, 2:49pm

The signal to noise ratio for the phylogenetic signal is so well known and so heavily written about that he should be heavily criticized. You would literally have to ignore tens of thousands of peer reviewed papers from a quick Google Scholar search.

pevaquark · July 27, 2017, 3:19pm

Wow, ‘phylogentic signal noise ratio’ gives me over 37,000 hits. What kind of person would ignore that kind of work? Well from one book review, the kind of person who says: “Remine does not only state that life accidentally looks unlike evolution, but that the designer intentionally designed the living world to look unlike evolution.” Yeah… okay buddy. At least he proposed a real model I suppose.

T_aquaticus · July 27, 2017, 3:24pm

I have often wondered if there is some filter on Google I am unaware of that prevents creationists from using it. For example, creationist after creationist will claim that there are no transitional fossils, yet a quick Google search turns up thousands and thousands of hits detailing known transitional fossils.

pevaquark · July 27, 2017, 4:01pm

Yes this filter exists when reading research papers:

Like this paper is absolute proof that God created the universe at time t=0:

Yet a nice summary is that:

Or this paper which can be read with ID goggles:

Or this paper which is read with Literal, Supernaturally Created Adam and Eve Goggles:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1766376/

cwhenderson · July 27, 2017, 6:58pm

I have read Dr. Craig referring to this sheep study before. Regardless of the reasonable explanations behind it, isn’t it interesting to note how the data is presented?

“The average individual heterozygosity observed (HO) in the population in 2003 was 0.48±0.11 s.d., which is 1.8–4 times that expected by Motro and Thomson’s model.”

Somehow, the range “1.8-4 times that expected” is mysteriously truncated to “4 times that expected”. Sadly, I have seen too many examples of misrepresentation of facts and figures to conclude this omission was accidental.

pevaquark · July 27, 2017, 8:05pm

No way…! I didn’t even look at the paper but that’s a huge difference. Also, technically Dr. Rana* not Dr. Craig for the article I cited at least.

The Rana article is so painful… he concludes the section on sheep mentioning that

if these same models were used to estimate the effective sizes of the ancestral population from the measured genetic diversity at any point in time, they would have overestimated the original population size as much larger than two individuals.

The model in the sheep paper is the Motro and Thomson model. This model is not mentioned at all in any of the human population estimates. And there’s a good reason why… correct me if I’m wrong any real Biologist. The Sheep paper actually says the Motro Thomson model applies only to isolated populations assuming no mutations.

cwhenderson · July 27, 2017, 8:35pm

I don’t know who wrote the article Craig referenced, but he was using the example quite liberally.

I’m honestly not very familiar with the model they used, but yes, it certainly looks from the article that they used a “no new mutation” model that would work under Hardy-Weinberg equilibrium conditions for a few generations, but not well for long periods of time. That was a good catch!