Fallacy of the Phylogenetic Signal? Part 2

EricMH · August 19, 2020, 11:48pm

In my last installment, I explained how to test the phylogenetic signal in a made up dataset, in an admittedly simplistic manner.

The phylogenetic signal is a statistical test to determine whether a proposed evolutionary tree is a good fit for the data. The basic idea is, can we build a tree from the data where the traits seem to follow the tree structure, vs just pop up randomly in various locations in the tree.

Why does all that matter? It has been said on this forum and elsewhere that one of the best evidences for evolution is how the genetic data forms ‘perfectly nested clades’, or in less technical terms, the genetic data forms a tree where the genes only travel down branches and not sideways to other branches. This is what we would expect if species evolved. If we were able to retrieve the genetic record looking backwards in time, we would expect to see genes to be passed on from ancestor to children in an unbroken chain, whereas genes would not be passed from one lineage to another independent lineage. Since we cannot travel back in time to do this, and can only work with present species, we should expect to be able to derive a tree from the genetic data that exhibits this unbroken chain of gene passing without lateral transfer.

Now, things are not quite that simple. There is evidence of lateral gene transfer, and there is also the issue that genes can disappear from a lineage. These two factors alone mean we do not expect a perfect tree. There will be some measure of deviation. But still, we should expect at least a ‘pretty good nested clade’, whereby the derived tree should still be statistically distinguishable from a tree with randomly distributed genes, and there should be only one way to create an optimal tree from the data, or at least we shouldn’t be able to create a number of dramatically different trees that have the same goodness of fit to our dataset. There are other issues, too, such as the creation of an optimal tree is a very difficult problem, and cannot be done except on very small datasets. We will ignore all that for now, and just focus on the building of the trees from the dataset, and measuring whether they are pretty good, and making sure we can’t build significantly different, but equally good, trees.

My contention is that these criteria, that we can build trees with non random genes, and we cannot build divergent trees, is not uinque to datasets that are generated by evolution. On the contrary, we can have datasets generated according to other models which can achieve equal, or perhaps even better, fit for these criteria than datasets created by evolution. In other words, we can generate non-evolutionary datasets from which we can derive a tree, which exhibits a strongly non-random character.

Last section introduced my basic methodology. The methodology is indeed basic, and does not really encompass my goal above. This is partly because I don’t understand the phylogenetic signal methodology used by scientists well enough to reproduce it myself, and also I want to start with a fairly easy to understand model so myself and interlocutors can all be on the same page. Thus, what I am about to show is quite limited, and does not conclusively substantiate my conjecture. Nevertheless, it does illustrate the path I am taking, and allow others to follow along. So, let us proceed.

Last time around I created a dataset through a simple evolution model, which constructs a tree graph, and then adds a new gene for each node in a lineage. Genes are never lost. I then take the leaves of the tree, and use a simple clustering algorithm, derive a tree from the dataset. I then compare the tree to a star model. The star model takes the genes common across all the leaves as the origin node, and the leaves are the spokes. The tree and star are compared for the dataset using a metric called sum of delta scores (SDS). The basic idea about the SDS is to score how much divergence there is between the genes of parent and child, across all the edges in the tree. A large score indicates there is a large divergence, and a small score a small divergence. A smaller score thus indicates a more concise description of the data. There are some concerns about whether this is a sufficient statistic, but it does at least capture some aspect of genes following a lineage, albeit not perfectly. The key point regarding SDS is that comparing two models for a given set of leaves, the model that describes the data more concisely will receive a lower SDS score. If the leaves can be decomposed into a tree structure, then the tree structure will have a lower SDS score than the SDS score for the leaves by themselves.

This time around, I will introduce two new models. The first model is the directed acyclic graph (DAG), and the second model is randomly generated equal length leaves. Both methods of generating datasets result in datasets that fit the tree much better than the star, and do this better than the dataset generated by evolution. As for why this happens, it will become obvious as we look at the graph visualizations. The reason is the star can only fit the dataset better than the tree when all the leaves have genes in common, and this condition will almost never happen with the DAG and random methods of generating leaves, but will always happen when leaves come from an evolutionary generation.

Now, enough with all the buildup, let’s see these alternate methods. First, we will look at the randomly generate leaves. The star graph for the leaves and the leaves themselves are the same, since randomly generated leaves almost never all share a collection of genes in common. As in the last installment, the genes are colored boxes with numbers, and two boxes of the same color and number are the same gene.

Now for the tree that is derived from this dataset. Since two leaves can share some of the same genes in the dataset, it is almost always possible to derive at least a partial tree from the dataset. I say partial, because you will see the nodes closer to the root tend to be empty, as we will reach incompatible nodes that do not share any genes. However, in a tree, all nodes except the root must have a parent, so even incompatible nodes will have a parent, which will necessarily be empty.

You will need to click on the graphic to see the full tree, and you will need to download the graphic to see all the detail. Again, this in the tree derived from the same dataset displayed above.

So, that is the randomly generated dataset and star and tree derived therefrom. As we discussed regarding the SDS metric, if a dataset can be decomposed into a tree at all, it will always score better (lower) than the dataset itself. Since the random dataset never shares common genes, the star root will always be empty, and the star model will always be equivalent to the dataset itself. On the other hand, there will always be leaves that share genes in the random dataset, and so the dataset will always be decomposable into a partial tree. This means that the randomly generated dataset will always fit a tree much better than a star according to the SDS metric. On the other hand, the evolved dataset must always share at least one gene amongst the leaves, if not more, so there is always the possibility the evolved dataset can fit a star model better than a tree according to the SDS metric. As a consequence, the randomly generated dataset has a better phylogenetic signal (at least in this formulation of the phylogenetic signal) than the evolved dataset.

Finally, let’s look at a DAG generated dataset. In this case, we will look at three different graphs. The first graph will be the actual DAG that created the dataset. The second graph will be the star derived from the DAG dataset. The final graph will be the tree derived from the DAG dataset. We will see the same situation applies, where a DAG dataset will almost never share common genes, and thus will always fit a tree better than a star, and consequently has a better phylogenetic signal than the evolved dataset.

Now, here is the original DAG for the dataset we will be examining. Before looking at the star, think about the characteristics of the DAG, and whether we expect the star to ever have genes in the root, i.e. whether the leaves will ever share a common set of genes.

Click to view the full graph, and download to see the full detail.

Next is the star for the same dataset. Note the reason why a DAG is unlikely to produce a star with a filled in root. It is because the DAG can have multiple roots, and if there is ever more than one disconnected root, there will always be two leaves that do not share any genes. Since the DAG is randomly generated, and for the parameters chosen, there are more DAGs with multiple disconnected roots instead of one root, then DAG datasets will almost always generate stars with empty roots.

Finally, let’s look at the tree graph for the DAG dataset.

Click to see the full tree. Download to see the full detail.

Since the DAG is likely to contain subgraphs that have some element of tree structure, thus it is likely there will be at least two leaves that share genes, and thus it is likely the DAG dataset can be decomposed into a partial tree. As discussed, any dataset that can decompose into a partial tree will receive a better SDS score than the dataset itself. Therefore, the DAG dataset is going to have a better phylogenetic signal than the evolved dataset.

This concludes the second installment of alternative models.

What can we learn from this investigation? Of course, it is a very simplistic take on the whole question of phylogenetic signal, and does not compare in the slightest to the mathematical rigor in the actual science of cladistics. This investigation serves as a thought experiment, to show there is a possible world, with some assumptions and methods that match the actual world of cladistics to some extent, in which evolved datasets actually possess a lower phylogenetic signal than non evolved datasets. So, in this possible world, the fact a dataset exhibits a phylogenetic signal is not evidence the dataset evolved. On the contrary, exhibiting a strong phylogenetic signal actually indicates the dataset did not evolve.

What does this investigation say about the real world of cladistic science? It of course does not call into question the entire enterprise. However, the investigation at least serves as a qualifier. It shows that we need to say a bit more about the dataset and its phylogenetic signal in order to infer the dataset evolved. The mere fact a dataset exhibits a phylogenetic signal according to some definition does not in itself indicate the dataset was generated by evolution.

EricMH · August 20, 2020, 12:57am

As an addendum, here is an example of an evolved dataset that has a lower SDS score for the star graph than the derived tree, which means the dataset does not exhibit a phylogenetic signal (within my formulation of phylogenetic signal). You can see the reason for the null result is due to the fact that many genes are common among the leaves, and this common set forms the majority of some of the leaves.

Original evolution tree.

Derived star graph.

Derived tree graph.

Klax · August 20, 2020, 8:04am

And again.

EricMH · August 21, 2020, 3:43pm

If anyone has any questions about what exactly I am arguing here, or is unclear about the analysis I put forth, or why does any of this even matter to the Biologos forum, I am happy to answer/clarify anything. Please ask away

At the very least, check out all the pretty colors in the graphs. They are somewhat eye catching, if I do say so myself

Dale · August 22, 2020, 1:09am

Yes, I noticed your graphics. Nicely done!

EricMH · August 25, 2020, 11:40am

So now that I’ve debunked the strongest evidence in favor of evolution have you all converted to creationism?

T_aquaticus · August 25, 2020, 2:44pm

I would be more interested in seeing results from standard phylogenetic algorithms.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.721.2540&rep=rep1&type=pdf

There are online tools you could use:

https://molbiol-tools.ca/Phylogeny.htm

EricMH · August 25, 2020, 4:03pm

I don’t think that’s what we need. Those are sequence based phylogeny tree constructors. But phylogenetic signal deals with how a trait, such as a gene, is distributed in a tree. So, these tools are not relevant to the question.

I could adjust my code to generate sequences, and have these tools derive trees. But that won’t tell us anything helpful regarding the phylogenetic signal, and veracity thereof.

We need some software that says whether a dataset contains a phylogenetic signal.

EastwoodDC · August 25, 2020, 4:08pm

/sub
will comment later.

EricMH · August 25, 2020, 7:22pm

yes, the inability of any forum member to refute my work shows evolution is completely dead and creationism is victorious

EastwoodDC · August 25, 2020, 9:36pm

Eric, I do not think this is at all clear. I read it three times and I can’t even identify useful questions to ask.

This would be a good place to start though. Show us some statistics!

PS: I agree with Dale the graphics are spiffy. Details on the methodology are lacking.

T_aquaticus · August 25, 2020, 10:01pm

I disagree. Until you use some of the gold standard methods for measuring phylogenetic signal your conclusions really won’t matter.

EricMH · August 25, 2020, 10:34pm

I do agree I need to use a gold standard. But the links you sent me do not seem to have any such thing. All they list, unless I’m missing something, is tree reconstruction algorithms, and as you mentioned elsewhere, we can build a tree out of any dataset.

I need some sort of gold standard tool that tells me whether my dataset has a phylogenetic signal.

Otherwise, I’ll have to reverse engineer one of these papers, and then you’d have to be able to confirm my work, and we’d be back to square one.

EricMH · August 25, 2020, 10:36pm

Sure, that’s easy to do. But in your three passes, did you not notice the informal proof that DAGs and random datasets must necessarily have a better phylogenetic signal (according to SDS and compared to a star) than an evolved dataset?

I can provides a stats this weekend, but I had thought they are redundant in lieu of the informal proof.

EastwoodDC · August 25, 2020, 10:41pm

No, and I still don’t. Sorry.

EricMH · August 25, 2020, 10:53pm

Alright, that is my failure. I will try to articulate the proof better in the near future.

EastwoodDC · August 25, 2020, 11:06pm

So here is the thing - you are making a pile of assumptions, which I would first have to identify to try to understand what you are doing so I could explain what I think is wrong. I’m pretty sure something is wrong here, but it’s a lot of work to figure that out. More importantly it’s not my job; you need to do a better job of defining this so that others can understand.

Some things I don’t see:
A clearly stated null hypothesis.
The distribution of SDS under the null hypothesis.
A sampling scheme (part of the null, which leaves do you “see”)
Clearly defined alternate hypotheses (and alternative distributions of SDS)

I’m pretty sure the decision rule “lower SDS score” will be broken if the trees have different numbers of nodes, or if sampling if unequal.
I think you have unstated assumptions about the size of trees and how they are generated that effect the distribution of SDS.

Suggestion: start with a very simple tree, 3 leaves

PPS: It wouldn’t surprise me if SDS has an asymptotic normal distribution. Even calculating the expectation under the null would be a good start.

EastwoodDC · August 25, 2020, 11:07pm

Eric is trying something. Play nice.

EricMH · August 26, 2020, 2:37am

this could be a potential problem

both tree and DAG have exactly the same set of nodes, but even so it is har for me to say to what degree the graphs are equivalent. as you say, if i perhaps increase the DAG nodes and decrease the tree nodes, or visa versa, the tree will suddenly start showing a better signal than the DAG

the purpose of the informal proof was to address this sort of concern, pointing out the difference of signalling is a property of the graph type, not the parameterization. so a large parameter sweep should produce the same results, regardless of respective node count of the DAG vs tree

i wii perform such a sweep to illustrate the proof, along with the other stats you reqest

Klax · August 26, 2020, 11:03am

Your ‘work’ doesn’t need refuting. It doesn’t qualify. It is dialectically inadequate, as Dan points out. Evolution is a fact. You need to refute that with a superior antithesis which makes the phylogenetic signal meaningless. You cannot.