Phylogeny vs Similarity and Function

Swamidass · January 8, 2017, 9:24pm

At a genetic level we see a great deal of similarity between organisms. This similarity is a fundamental feature of life, that has to be explained by any theory.

Evolution explains that this similarity as largely caused by shared history, by way of common descent. We see function often aligning in sensible ways on phylogenetic trees of sequences and organisms. Of course, this is not perfect, because we also know that life changes over time. As the YEC Walter ReMine accurately points out in The Biotic Message, function will not always follow a nested clade pattern in an evolutionary model. (He argued that YEC would always follow the nested clade pattern, but it turns out that biology does not always do so).

Anti-evolutionists in the ID and creationist camps explain this similarity as the caused by common function engineered by a common designer. The thought here is that any shared function we see aligning to phylogenetic trees is just an illusion. Trees are constructed by similarity measurements, and these measurements are shaped by shared function. Nothing is gained by constructing a false history through the phylogeny, and we would do better to just see similarity as caused by shared function, shared design.

So there is a field of bioinformatics that directly tests these two hypothesis for the immensely practical problem of determining the function of unknown sequences. The experiment is straight forward. We compare the accuracy of function prediction algorithms that use pairwise similarity (e.g. BLAST searches) to assign function with those that use phylogeny instead (e.g. SIFTER).

We find that phylogeny informed predictions are two times better than similarity informed predictions. This is a fairly direct test of the two hypothesis. If phylogenies are an illusion, that do not actually represent a true history, why do they predict function better than similarity? I would say this is evidence that phylogenies are more than just an illusion. They capture something about function that is more informative than just similarity, even though they are inferred from similarity.

So similarity is used to infer phylogenetic history, and this history is more correlated with function than similarity itself.

You can read some of the references (and the figures are great too) here:

And follow me (@swamidass) and one of the early pioneers in the field (@phylogenomics) on twitter for an entertaining dialogue on this.

Chris_Falter · January 9, 2017, 4:13am

How Joshua -

I appreciate the nice summary of this work. Would you mind answering some questions from the data science perspective?

(1) I wonder how you address the problem of leakage. Here’s how the problem can be stated:

(a) Phylogeny is based on shared function.

(b) Therefore phylogeny contains embedded information about shared function.

(d) Thus it is no surprise that phylogeny-based predictions are very accurate. Shared function is being used to predict shared function. The labels being predicted have leaked into the inputs.

You could avoid this problem, I suppose, if the shared functions being predicted have not been used in any way to build the phylogenies. Erecting that barrier between phylogeny and predicted function might be very difficult, however. Shared function has been used to build phylogenies for a very long time.

Your thoughts?

(2) What does the naive prediction represent, and how is it modeled?

Thanks!

Chris Falter

Jonathan_Burke · January 9, 2017, 4:38am

The closer ID is examined, the closer it is to YEC in the sense that both must explain why the universe presents an appearance of reality which is (according to YEC and ID), actually false. For YEC this manifests in light, rocks, and fossils with only an “false appearance” of age, and for ID this manifests in organisms with only a “false appearance” of evolution.

Swamidass · January 9, 2017, 6:32am

Absolutely. You have given some good questions that are really important to clarify.

First, the easy question…

You find this explained in the paper:

Here, naïve is a weighted random prediction in which every protein was ‘predicted’ to have functions proportional to the relative frequency of the terms in Swiss-Prot.

It is just a baseline method you really hope your smarter method can do better than. It is an example of a control benchmark that ensures the methods being developed are working better than random.

Now the more important one…

So I understand your objection, and agree that this is a severe problem when hand constructed phylogenies of organisms are used as “proof” of evolution. Because function is used to construct the phylogeny, it is not at all surprising that it predicts function. This is too circular to be considered unbiased evidence.

However, and this is very important, in the experiments I linked to function is NOT used in any way to construct the phylogeny. Here, instead of using hand constructed phylogenies based on organism attributes, the software uses sequence similarity to build a phylogeny of sequences. This is fully automated and based entirely on the information in sequences, and absolutely nothing else.

So therefore your summary…

This summary is not accurate. Function is not used to build these phylogenies. There is no leakage because function is not used to build the phylogeny at all.

A recipe of the similarity algorithm is (representing the hypothesis that “proteins are similar because they share the same function”)…

Take unknown sequence and search for most similar sequences (e.g using BLAST) in a large database of sequences of known function.
Assign the functions of the matched sequences to the unknown sequence.

A recipe for the phylogeny algorithm (representing the hypothesis that “protein similarity is caused by shared history, and that shared history is closer correlated to protein function than similarity itself.”)…

Take unknown sequence and search for a very large collection of sequences of both known and unknown function (e.g. using BLAST).
Now use phylogenetics to systematically reconstruct the evolutionary history of the sequences into a tree.
Now overlay annotations of known functions on the tree (i.e. the evolution history of the sequences), and infer points in the tree where function is gained or lost.
Use this inferred history to predict the function of the unknown sequence. (all steps are handled by SIFTER).

The test of the hypothesis is to see which better predicts function (BLAST alone or SIFTER). This discriminates the two hypotheses, telling us which better explains our world. Some very important points:

First, #2 (constructing the tree) and #3 (inferring when functions are gained and lost in the tree) are separate steps. The tree itself does not use functional annotations to be constructed.
The predictions you get out of the similarity and phylogeny based approaches are different. And the predictions from the phylogeny approach (which uses evolutionary history) are at least 2x better than similarity alone.
So similarity is used to infer history, but function correlates more with history than with similarity. This is fundamentally important result, that is strong quantitative evidence for evolution

If evolution (by which I mean shared ancestry) is false, why would this be?

Swamidass · January 9, 2017, 6:36am

I do have one alternate non-evolution explanation for this.

If God created everything through special creation, in a specific way, we might see this pattern too. If he creates new species by genetically editing prior species, adding and removing functions as he sees fit, that could produce the same pattern. This could be consistent with the Reasons to Believe old earth creationism model.

Though, it is important to recognize that in this model, still, that phylogeny is constructing a true history. And we are at the limits of science. Science cannot comment on God’s direct action, and there is really no way of ruling out God creating everything by a mechanism that produces data just as we expect evolution will produce.

Swamidass · January 9, 2017, 6:53am

That is not totally fair.

Some ID proponents accept evolution (as in shared ancestry). So this is not a universal feature of ID.

Moreover, the whole point of creation science and anti-evolution flavors of ID is that there is strong evidence against evolution. Of course, they are wrong here, but they do not actually think there is no evidence behind their case. Rather, they think they evidence behind their case has been suppressed. Once again, they are wrong, but this still is not the right way to explain their position.

Also even the most staunch atheists agree that life “appears” to be designed; it is “designoid.” From an atheist point of view, they are explaining why this is just a “false appearance” of design, and illusion.

I think the reality is different. What the universe “looks” like is very defined by culture, preconceptions and knowledge. In science, we study very subtle features of the world, that are almost never easily observable. And it is subtle patterns in these features that helps us untangle how the world works and its history. None of this, fundamentally, is obvious, common sense, or easy. This is not how we are wired to think at all, and we regularly find surprising non-intuitive things about the universe that go against what it “looks” like to us.

That, in my view, is the beauty of science. It invites into a new view of the world, one that often sharply differs from what we think is so obviously seen in nature.

gbrooks9 · January 9, 2017, 2:30pm

I think this is a pretty important observation!

Except for the very few known ID supporters who accept the premise that Intelligent Design is tied to a Very Old Earth scenario (which essentially makes them a BioLogos supporter) - -

- - the vast majority of ID supporters have to conclude that God started creating (aka “Poofing”!) thousands of new species every year since the landing of Noah’s Ark, in order to arrive at the current period’s species count in the Millions!!!

gbrooks9 · January 9, 2017, 3:30pm

@Swamidass,

Who can be named in the I.D. “camp” that accepts “Common Descent” in a very old earth context?

benkirk · January 9, 2017, 4:35pm

Swamidass:

Moreover, the whole point of creation science and anti-evolution flavors of ID is that there is strong evidence against evolution.[/quote]
I strongly disagree. If that was their belief, they would delve into the evidence. Instead, they make explicit false claims about evidence, but only hearsay underlies those claims.

Of course, they are wrong here, but they do not actually think there is no evidence behind their case.

I disagree. They clearly do not believe there is evidence, which is why they routinely substitute hearsay for evidence when challenged.

[quote]Rather, they think they evidence behind their case has been suppressed.
[/quote]If your hypothesis is correct, they would be producing new evidence themselves, not pretending that science is about mere arguments.[quote=“Swamidass, post:6, topic:26471”]
That, in my view, is the beauty of science. It invites into a new view of the world, one that often sharply differs from what we think is so obviously seen in nature.

That’s something I totally agree with.

Chris_Falter · January 9, 2017, 7:48pm

Per a source I don’t consult often, RationalWiki:

“Denton does not deny common descent which distinguishes him from most intelligent design proponents.”

Michael Denton is a senior fellow at the Discovery Institute’s Center for Science and Culture.

Hugh Ross doesn’t accept common descent with respect to homo sapiens sapiens, and he would generally be described as an ID proponent. But he argues vehemently that the earth is 4.6B years old and the universe is 13.8B years old. He has publicly debated Jason Lisle of Answers in Genesis on the age question many times.

Cheers,

Chris_Falter · January 9, 2017, 8:19pm

Hi Joshua -

I am hoping to pull the thread back to your original post, which still fascinates me. Based on your latest post, here’s my understanding of the process:

Select a sequence/species and a comparison set of sequences from roughly similar species.
Project the species under study into a relationship with the other species.
a. One method is a set of distance scores, based on pairwise similarity as measured by BLAST.
b. The other method is a position in a phylogenetic hierarchy, based on SIFTER analysis of the sequences.
Assign function labels to the other sequences
Use the projected relationship + function labels to predict the function of the sequence under study.

The goal is to determine which method of projecting relationships yields better predictions: straightforward genetic similarity (method a) vs. phylogeny (method b).

The results show that phylogeny yields far more accurate predictions. Thus we infer that phylogeny is a better description than simple genetic similarity of the relationship between the species.

Aside from the fact that some of the steps are grossly oversimplified – in particular the final step – have I accurately depicted the research?

Critics of evolution often state that genetic similarity has no bearing on the question of biological origins and common descent because we expect similar functionality to be supported by similar genetic sequences. The implication of the research, however, is that phylogeny is more important than pure genetic similarity in predicting function; therefore we can confidently accept the reality of a phylogeny (i.e., common descent).

Again, thanks! And please feel free to clarify anything I might have misunderstood or explained poorly.

Godspeed,

Swamidass · January 9, 2017, 11:11pm

Thanks. It really is a fascinating study.

So your summary is incorrect in a couple points.

First, the algorithms we are discussing operate on genes, not species. It traces the phylogeny inferred from the sequences of these genes. Therefore duplication events give rise to different leafs in the tree, and most species therefore map to multiple leaves in the tree. The phylogeny algorithm can make inferences about exactly when in the tree the duplications occur.

Second, the analysis does not pick sequences based on the species. Instead, it starts by finding all similar sequences, including those that are only remotely similar, that have been observed in nature. It is a 100% sequence based approach. As long as there is enough sequence similarity, the gene is included in the analysis.

Third, your language about “projecting” relationships is a reasonable analogy. It matches closely what is happening in the similarity case, but it isn’t exactly what is happening in the phylogeny case. Still, your schematic is about right.

In the phylogeny case, we try and reconstruct the history of the genes. We first do this by using the genetic similarity to give us the ancestry relationships between genes. Next, we infer the points (with careful attention to uncertainty) in the tree where specific functions are gained and lost. So we now have an inferred history of all the genes in the phylogeny and the gain and loss of function in these genes. Of course, this model makes predictions about the function of genes that are not annotated.

They key point is that the reconstruction of history does change the predictions we make on the function of unknown sequences. And the predictions improve dramatically. If the reconstructed history was just a false reality, this just does not make sense. Of course, there is still real uncertainty in the history, and even errors, But there seems to be enough correct inferences in it to improve predictions of function dramatically.

As you put it…

That is correct.

Exactly.

Well I would be more circumspect.

I would say: this is strong evidence for the “similarity caused by shared history” hypothesis over the “similarity caused by common function” hypothesis, whether or not evolution is ultimately true.

Moreover, this directly tests the adequacy of a design principle in explaining biology. It is absolutely true that proteins/machines/etc that have the same function often show similarity to one another. It also true that very similar thing have different functions, and very different things have similar functions. In the end, we need to see how well this principle explains the data over alternatives. We find that phylogeny really does systematically improve our predictions of function, over that of just using the “common-function causes similarity” design principle.

I think this is an important body of work for those that dismiss evolutionary theory as useless, or assert that similarity data captures everything we get from phylogenies. Something real and useful is being inferred by phylogenetics.

Thanks for the questions, and I hope that clarifies things!

Chris_Falter · January 9, 2017, 11:16pm

Two thumbs up, Joshua! Thanks for patiently answering the questions, and correcting the mistakes, of this tyro.

Jonathan_Burke · January 10, 2017, 12:33am

Even the tiny number of ID proponents who accent shared ancestry, still reject evolution as an explanation for the diversity of species. If ID proponents really accepted evolution, they would be Evolutionary Creationists, but they take great care to differentiate themselves from that position, precisely because they do not accept evolution. Not only that, but they repeatedly argue that evolution is simply not a valid explanation for the diversity of species.

They say they have no problem with natural selection, but don’t believe it can lead to evolution. They say they have no problem with genetic mutation and gene drift, but don’t believe it can lead to evolution. They say they have no problem with the fossil record, but don’t believe it provides evidence for evolution. They say they have no problem with common descent, but keep raising arguments against it and arguing that it isn’t evidence for evolution. They say they don’t believe evolution is necessarily incompatible with the Bible, but keep raising arguments as to why it’s incompatible. All this is precisely why IDers not only oppose evolution but also oppose evolutionary creationism.

Actually I agree with Ben Kirk on this point. I think he identified the issues very well.

This needs to be filled out a bit more. Even the most staunch atheist agrees that some forms of life appear to be designed. However even the most staunch creationist finds it hard to explain why many forms of life do not appear to be designed; fish with eyes which don’t work, insects with inactive wings trapped beneath a hard carapace, and the hideously dangerous and painful birth process of the spotted hyena (resulting in 60% of cubs being stillborn, and a high death rate among first time mothers), to name just a few.

This is why atheists don’t go around saying life has a “false appearance of design”. Rather, they explain (as you did), that life is generally so well adapted that we interpret some forms as having an appearance of design, because of our personal perspective (very much as you described). The problem for many Christians on the other hand is that since they rightly believe God created the universe then they have to explain why God did so in such a way as to give it a false appearance of age, or the false appearance of evolution. This is not a problem for the atheist.

Christy · January 10, 2017, 1:48am

@benkirk
@Jonathan_Burke
There really is no need to turn every thread into a “why ID/YEC folks are the worst” tangential tirade. Please try to stay on topic.

Jonathan_Burke · January 10, 2017, 2:02am

I don’t believe either of our posts could be described in that way. The whole point of this thread is which explanation best describes the evidence. Joshua has explained why evolution explains the evidence better than ID and YEC. Ben and I agree, and have provided additional commentary on the point.

Christy · January 10, 2017, 2:21am

Your “additional commentary” needs some new material.

benkirk · January 10, 2017, 3:02pm

[quote=“Christy, post:15, topic:26471”]
There really is no need to turn every thread into a “why ID/YEC folks are the worst” tangential tirade.[/quote]
I don’t see that either of us claimed that “ID/YEC folks are the worst.” We are simply disagreeing with Swamidass’s description of positions regarding evidence.

[quote]Please try to stay on topic.
[/quote]I’m not following you. We were responding directly to Swamidass. Were his comments off topic? His OP was about analyzing evidence.[quote=“Christy, post:17, topic:26471”]
Your “additional commentary” needs some new material.
[/quote]Are you saying that ID people actually do claim that evidence is being suppressed? I’ve certainly never seen anything of the sort.

This point is incredibly important. No one is claiming that similarity simply demonstrates evolution, while the denialists’ straw men rarely go beyond the word “similarity,” used as vaguely as possible. The details of the evidence are where it’s at.

Christy · January 10, 2017, 8:30pm

I understand that. It was a defensive, preemptive maneuver on my part, because I could envision where things were headed. It seems to me that we already have three or so open threads with plenty of “ID is not real science” themed content contributed by the two of you and more of the same on this thread would be an unnecessary bunny trail. We are all aware of your views on the topic of how ID folks handle evidence.

Chris_Falter · January 10, 2017, 10:47pm

Measured by the “heat index” (number of posts), those threads have been by far the most popular. They may not have fared quite so well as measured by the “light index” (understanding advanced), however.