Thanks. It really is a fascinating study.
So your summary is incorrect in a couple points.
First, the algorithms we are discussing operate on genes, not species. It traces the phylogeny inferred from the sequences of these genes. Therefore duplication events give rise to different leafs in the tree, and most species therefore map to multiple leaves in the tree. The phylogeny algorithm can make inferences about exactly when in the tree the duplications occur.
Second, the analysis does not pick sequences based on the species. Instead, it starts by finding all similar sequences, including those that are only remotely similar, that have been observed in nature. It is a 100% sequence based approach. As long as there is enough sequence similarity, the gene is included in the analysis.
Third, your language about “projecting” relationships is a reasonable analogy. It matches closely what is happening in the similarity case, but it isn’t exactly what is happening in the phylogeny case. Still, your schematic is about right.
In the phylogeny case, we try and reconstruct the history of the genes. We first do this by using the genetic similarity to give us the ancestry relationships between genes. Next, we infer the points (with careful attention to uncertainty) in the tree where specific functions are gained and lost. So we now have an inferred history of all the genes in the phylogeny and the gain and loss of function in these genes. Of course, this model makes predictions about the function of genes that are not annotated.
They key point is that the reconstruction of history does change the predictions we make on the function of unknown sequences. And the predictions improve dramatically. If the reconstructed history was just a false reality, this just does not make sense. Of course, there is still real uncertainty in the history, and even errors, But there seems to be enough correct inferences in it to improve predictions of function dramatically.
As you put it…
That is correct.
Exactly.
Well I would be more circumspect.
I would say: this is strong evidence for the “similarity caused by shared history” hypothesis over the “similarity caused by common function” hypothesis, whether or not evolution is ultimately true.
Moreover, this directly tests the adequacy of a design principle in explaining biology. It is absolutely true that proteins/machines/etc that have the same function often show similarity to one another. It also true that very similar thing have different functions, and very different things have similar functions. In the end, we need to see how well this principle explains the data over alternatives. We find that phylogeny really does systematically improve our predictions of function, over that of just using the “common-function causes similarity” design principle.
I think this is an important body of work for those that dismiss evolutionary theory as useless, or assert that similarity data captures everything we get from phylogenies. Something real and useful is being inferred by phylogenetics.
Thanks for the questions, and I hope that clarifies things!