I did not realize that. It cannot be too great a difference otherwise attempts to rebuild phylogenetic trees are equivalent to the dag scenario I explained above, implying the data is forced into a tree. Seems like horns of a dilemma.
I might not know what I am talking about. Maybe differing alleles are considered the same gene.
That’s a whole new bag of worms. We would then have to ask why so many ID proponents claim that most of the human genome is functional when the vast majority of that genome is accumulating mutations at a rate consistent with neutral drift. We also have to factor in contingency and dependence where the loss of the previous function has a more severe consequence than any new function the gene acquires.
We also have to factor in time. A good analogy is language. If you go back in time far enough you won’t be able to understand our English speaking ancestors. However, each and every generation between that ancestor and you were able to communicate with each other. Genes can also take on so many mutations that we can no longer recognize them as having come from a specific shared ancestor. At best, you would need a lot of coverage across a phylogeny to even have a chance of reconstructing ancestral sequences, or recognizing deeply shared ancestry between genes.
Gene A recombines with Gene B to produce AB. AB recombines with C. So on and so forth.
Gene A duplicates into gene A and gene B. Gene B accumulates a lot of mutations since it is no longer under selection to fill the function supplied with A.
Like any method of measurement, it has its limits. Over time you will start to get mutations that happen multiple times at the same base, and without enough information from multiple species this will be counted as a single mutation. In other words, mutations start to saturate the sequence so that you are no longer counting single events.
Biology happens. It is vastly complex and ever changing. One of the fundamentals of the model is that lineages are independent of one another. A mutation that happens in one lineage can not jump its way across the tree and land somewhere else, at least in the case of eukaryotes. Prokaryotes will swap DNA with whatever other bacteria happens to walk by.
Eukaryotes also have complexities of their own. Mitochondrial DNA mutates at a higher rate than autosomal DNA, and there can be transfer of mtDNA to the nuclear genome which are called numt’s. So you would have to consider the different rates of evolution in mitochondrial and nuclear DNA, when the numt occurred, and so forth. There other examples of this same dynamic, such as different rates of evolution in exogenous retroviruses and endogenous retroviruses.
So if you are asking for a simple model . . . well, there isn’t one.
Deleted b/c I thought of some flaws in the simulation.
I’m not sure a biologist would agree. The single-celled E. Coli has about 3000 genes, for example–about 12% of the human total. If the common ancestor of multi-celled animals 1bya had that many genes, then we would have no ability to reconstruct a rooted tree with thousands of today’s genes serving as leaves.
That said, it would be useful to get some input from a biologist on this question. @glipsnort, @DennisVenema: do you have any insights to share?
Peace,
Chris
googling phylogenetic signal it says organisms from the same branch resemble each other more than randomly selected organisms
is there more to it than this? how do scientists guard against the phylogenetic signal just being a function of the tree inference algorithm? e.g. if we incrementally build a tree by joining closest organisms and clusters, then we will get a phylogenetic signal even if we start with randomly generated dna sequences
I’m not an expert on phylogenetics, but from what I understand it is a question of how well the data fit a tree. You can force any data set into many different trees, but the real test is if those trees are statistically significant. If there are many divergent trees that all have the same likelihood then you don’t have a phylogenetic signal. If you have a handful of very similar trees that have the same likelihood then you can have a significant phylogenetic signal.
You should check out the wiki entry on computational phylogenetics. Some of it may be right up your alley.
I’ve looked through it, and it seems very difficult to definitively say a phylogeny is correct, and there are many caveats, and it sounds like a lot of data is discarded if it is too ‘noisy’ for the tree model. Which still makes me wonder how do we know the tree model is correct, and if we are not fitting the data to the model. That is the one aspect which I do not see covered, even in the referenced paper on how to measure phylogenetic signal. Which is the big unanswered question, and why it seems to me the theory of evolution is not approached in a scientific manner, i.e. the fundamental assumption that life forms a treeish graph is never questioned. Science is about questioning assumptions, not trying to figure out how to make the data fit an assumption.
This may be a good place to start:
Most statistical tests boil down to the question of whether a random set of data would produce the observed data. It goes clear back to the the early days of statistics when Fisher calculated the odds of someone randomly guessing which cups of tea were milk first or tea first.
From the paper:
These guidelines aim to better assess phylogenetic signal and distinguish it from random trait distributions
The question is not “does a tree fit better than randomly distributed traits”, but whether a tree is the best graph structure to describe the data. I.e. no one seems to have done what Ewert did and question whether there is a better graph structure.
It’s rather easy to assess because Ewert’s model requires massive violations of a nested hierarchy which aren’t seen. On top of that, no one would use incomplete and inconsistent annotation databases when there is much more accurate sequence data available. It would be interesting to see Ewert’s results when using sequence data.
This is my point, which is no one looks for these violations. I believe, based on what I have read so far across multiple sources, that everyone else tries to just fit a tree model, and dismiss the violations Ewert saw as just noise, which they ignore or discard. Seems very playsible given how difficult it is to construct a phylogenetic tree and how much the data is massaged to fit the tree. Which is why I ask if anyone else has done an analysis along the same lines as Ewert, asking if there is an even better model than the tree. I think Ewert is the first to do this.
In which case, it is invalid to say a tree is evidence of evolution, since no other hypotheses have been tested besides randomness.
That’s like saying no one looks for a 300 foot tall raging dinosaur in downtown Manhattan, so there might be one. These violations should be impossible not to see.
Perhaps it would help if you gave us specific examples from Ewert to work from. I could pick some out if you like. For example:
How well do you think this statement would hold up under further investigation? If we took specific genes from these gene families and searched for homologs in other species, would we find them?
That is non obvious. If researchers only try to fit a tree and discard data that does not fit, then they are sweeping all the violations under the rug. It is like the newscaster focusing on a lady walking her dog while Godzilla rampages downtown behind the TV crew and the newscaster claiming everything is fine.
A species with feathers, three middle ear bones, and mammary glands would be pretty hard to miss.
I mentioned the alleged case of genes only being shared between zebra fish and zebra finches and not other species. That should be easy to check, shouldn’t it?
Godzilla! CLOVERFIELD!!! That’s TWICE. And out of the mouth of two or three witnesses…
I don’t have the know how to perform the check.
Unfortunately, Ewert never really specified what these gene families shared by just one species of fish and bird are, so your guess is as good as mine. If you have a line of communication to Ewert it would be great if we could get a list of genes he thinks are only found in these two species. Once we have the name of the genes (preferably Ensembl, NCBI, or Uniprot accession numbers so there’s no confusion) it is straightforward to get on various databases and see if there are homologs in other species.
not strictly true
all the graphs are referenced by doi and can download the json files to find out
That may be a project for later.
As a prelude, think about his approach in a general sense. He is comparing genes based on how the genes are annotated and organized into gene families in each database. Even by Ewert’s own admission, the same gene can be in different gene families in different databases. Also, annotations are rarely complete which makes it almost guaranteed that an unannotated gene in one species will have an annotated homolog in another species. Annotations are simply not a reliable data set for what Ewert is trying to do. What is reliable is sequence data, and I think it is very telling that Ewert doesn’t use this data set. Ewert doesn’t even spot check his data by looking at a handful of outlier genes that he is basing his conclusions on.
Doesn’t it seem rather obvious to do something like a BLAST search to see if these genes really are present or absent in various genomes?