Why I remain a Darwin Skeptic

EricMH · August 5, 2020, 8:51pm

I asked him about sequence data and he hasn’t figures out how to apply his technique at the sequence level yet.

BLASTimg a gene sounds like an easy check with the json files.

EricMH · August 5, 2020, 9:34pm

Alright, now I think my script is ready to demo. Experiment with it yourself at the following link.

https://repl.it/talk/share/Phylogenetic-signal-experiment/48483

An explanation of the script:

This script generates three different synthetic datasets, and checks each for phylogenetic signal.

The test is the following:

Infer a tree model from the dataset and calculate probability score
Calculate probability score for a star model for the dataset
If tree model has greater probability than star, then the dataset has phylogenetic signal

The three datasets are generated as follows:

Create a tree graph and collect the leaves for the dataset
Create fake data by randomly generating leaves for the dataset
Create a DAG and collect the leaves for the dataset

I then ran the phylogenetic signal test explained above on each dataset, and recorded how frequently the hypothesis that the dataset has a phylogenetic signal was violated.

Out of 20 experiments, here are the results:
Tree dataset: phylogenetic signal hypothesis falsified 1 / 20 times.
Fake dataset: phylogenetic signal hypothesis falsified 0 / 20 times.
DAG dataset: phylogenetic signal hypothesis falsified 0 / 20 times.

So, both the datasets generated from the DAG and from fake leaves have a slightly better phylogenetic signal than the dataset generated from a tree.

This is why extracting a tree from a dataset is insufficient evidence to say the dataset was generated by a tree.

T_aquaticus · August 6, 2020, 2:59pm

Why not use real data sets?

EricMH · August 6, 2020, 3:37pm

I’d just be reproducing Ewert’s paper. That isn’t the point, however. The point is to show phylogenetic signal is meaningless as evidence for evolution, since DAGs and random data generate greater phylogenetic signal than trees.

T_aquaticus · August 6, 2020, 3:45pm

Not if you used sequence data.

It is evolutionary mechanisms that create phylogenetic signal, not trees. If I have the correct understanding, your data needs to have been produced by a Markov chain-like process where you have vertical inheritance and random changes within those lineages. Is that how your data was produced?

EricMH · August 6, 2020, 4:07pm

This is a simplistic gene evolution model where each edge in the graph adds a new gene as well as passing all the genes from source node to destination node for that edge. So, for instance if edge A and edge B both terminate at node C, and the source nodes for A and B each have 4 genes, then C will have 4+4+1=9 genes. Genes are stored in unordered sets, and two gene sets are compared with the Jaccard distance.

Sequence level simulation is beyond me at this point, but I believe this gene based simulation is a good demonstration of my critique.

T_aquaticus · August 6, 2020, 4:12pm

Not a coder, so these are honest questions.

So are you using gene counts, or are you using more of a matrix of which specific genes are present in each organism?

EricMH · August 6, 2020, 4:25pm

It is the latter.

I will also be creating visualizations for all this so it is more understandable, and you can more easily see if my critique has merit.

system · August 13, 2020, 1:25pm

This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.