I asked him about sequence data and he hasn’t figures out how to apply his technique at the sequence level yet.
BLASTimg a gene sounds like an easy check with the json files.
I asked him about sequence data and he hasn’t figures out how to apply his technique at the sequence level yet.
BLASTimg a gene sounds like an easy check with the json files.
Alright, now I think my script is ready to demo. Experiment with it yourself at the following link.
https://repl.it/talk/share/Phylogenetic-signal-experiment/48483
An explanation of the script:
This script generates three different synthetic datasets, and checks each for phylogenetic signal.
The test is the following:
The three datasets are generated as follows:
I then ran the phylogenetic signal test explained above on each dataset, and recorded how frequently the hypothesis that the dataset has a phylogenetic signal was violated.
Out of 20 experiments, here are the results:
Tree dataset: phylogenetic signal hypothesis falsified 1 / 20 times.
Fake dataset: phylogenetic signal hypothesis falsified 0 / 20 times.
DAG dataset: phylogenetic signal hypothesis falsified 0 / 20 times.
So, both the datasets generated from the DAG and from fake leaves have a slightly better phylogenetic signal than the dataset generated from a tree.
This is why extracting a tree from a dataset is insufficient evidence to say the dataset was generated by a tree.
Why not use real data sets?
I’d just be reproducing Ewert’s paper. That isn’t the point, however. The point is to show phylogenetic signal is meaningless as evidence for evolution, since DAGs and random data generate greater phylogenetic signal than trees.
Not if you used sequence data.
It is evolutionary mechanisms that create phylogenetic signal, not trees. If I have the correct understanding, your data needs to have been produced by a Markov chain-like process where you have vertical inheritance and random changes within those lineages. Is that how your data was produced?
This is a simplistic gene evolution model where each edge in the graph adds a new gene as well as passing all the genes from source node to destination node for that edge. So, for instance if edge A and edge B both terminate at node C, and the source nodes for A and B each have 4 genes, then C will have 4+4+1=9 genes. Genes are stored in unordered sets, and two gene sets are compared with the Jaccard distance.
Sequence level simulation is beyond me at this point, but I believe this gene based simulation is a good demonstration of my critique.
Not a coder, so these are honest questions.
So are you using gene counts, or are you using more of a matrix of which specific genes are present in each organism?
It is the latter.
I will also be creating visualizations for all this so it is more understandable, and you can more easily see if my critique has merit.
This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.
“Let your conversation be always full of grace, seasoned with salt, so that you may know how to answer everyone.” -Colossians 4:6
This is a place for gracious dialogue about science and faith. Please read our FAQ/Guidelines before posting.