Ewert has actually made an appearance at Peaceful Science. If you’d like to discuss the paper with the author directly, this is probably the best opportunity to do so.
And Ewert has been very gracious. So for those that do come on over to peaceful science please return the favor.
It will be interesting to see the guidance and predictions that the DG model can provide. One obvious one is that the DG model produces genetic modules. This is a completely new modeling element. The DG model shows how these modules support (feed into) the species. So this is going to raise a lot interesting questions about how the genetic modules are related. What groupings or other patterns, of the genetic modules do we see, and what do these groupings tell us about their design, function, and roles in molecular and cellular biology?
Here’s another one. I’d be interested in seeing the DG model for microRNA genes. Given how these newly-discovered (relatively new) genes have contradicted CD, I can’t help but wonder if they turn out to have an interesting DG pattern. Just a thought, but it seems interesting to me.
The bottom line is, these genetic modules that DG produces are a new construct. They fall out of constructing the DG diagram. So what meaning, if any, do they have? I suspect it will be pretty interesting, and whole new ways of thinking will be opened up. But who knows?
Signal vs. Noise, Part 2: Hunter Opens the Klassen Study Again
Signal vs. Noise, Part 2: Hunter Opens the Klassen Study Again
I agree with you about the gene modules, and I was thinking they might be very interesting if applied to some examples of conversion in the evolutionary framework.
Thanks for the pointer. I’ve taken my comments there, which does indeed look to have a more pleasant tone.
Probably a topic for a different thread, but I’d be interested in hearing why the existence of miRNA contradicts common ancestry.
No, not the existence; the miRNA phylogenies.
I think your points about checking for data gaps is, obviously, a good one.
I was glad to see that! Hope you spend more time over there. There are always some interesting discussions going on.
Ah, gotcha. I’d be interested in reading about that, too.
Here are two articles by @Cornelius_Hunter for you:
or one that’s a little bit older:
Here is a Nature News article midway in between from 2014-
From what I am seeing, there was once a great hope that microRNA would help illuminate phylogenies with the idea that the sequence was generally more highly conserved (which turns out isn’t true) and thus to use them alone is a fool hearted way to try and make such trees as they would be no better than even just tracing any particular gene.
Okay. It seems the mechanism is the intelligent designer has some modules he’s constrained himself to work with and then grabs relevant pieces from each module to intelligently design each original pair (if reproducing sexually) for each species? Would there be any such thing in this frame as even ‘mini-common descent?’ That is where at least horses and zebras could have shared a common ancestor or not even that?
I have to say that I am shocked by the implication that any theory of science (and least of all evolution) would be spoken as similar to belief in the resurrection of Christ. Perhaps you misspoke.
I will add that when any area of science is considered beyond doubt and examination, that ceases to be science.
Well that’s an interesting question, because the DG model is kind of a superset, or generalization, of CD.
Before I give a critique of Ewert’s paper, I want to acknowledge that it represents an important step in the debate over the models of origins. Ewert was not content to point out perceived flaws in conventional theory; he did his best to advance an alternative theory and to evaluate the two theories with respect to a certain class of genomic data. This approach warms my heart.
I do want to mention that I read through the paper only once, so I would welcome any corrections of points I might have misconstrued or just out-and-out overlooked.
In my opinion Ewert’s approach unfortunately suffers from flaws that prevent it from being used to draw any useful conclusions or comparisons. I do not believe these flaws are beyond remedy in future research, however. I hope that Ewert and others can address the methodological critiques of biologists like @glipsnort, @swamidass, @Sy_Garte, @sfmatheson, and @T_aquaticus in the near future.
Since I have graduate education and decades of working experience in computational methods and model evaluation, I will stick out my neck and offer a critique that I hope will be of some value, as well. I offer three main observations.
- The comparison of tree vs. graph dependency predictions on simulated data is inapplicable to the biological data. Consequently, the research method is not validated.
- The tree model is oversimplified in comparison to evolutionary mechanisms, resulting in underfitting and inferior predictions with respect to cross-genomic data.
- Ewert’s simulation results cannot be replicated.
Conclusion 1: The simulation was invalid
1. The simulation did not model the biological data subsequently analyzed.
- Too few iterations by several orders of magnitude. Per the standard chronology that I do not believe Ewert (or Hunter) disputes, metazoans first appeared on earth at least 665 MYA. If we assume that the typical metazoan generation since then is very approximately 0.66 - 6.6 years, the simulation should have traversed 107 - 108 generations. However, the 5 simulations only traversed 104 generations.
- Too few habitats and niches by orders of magnitude - The first 3 simulations had only 1 habitat and a few niches. The fourth and fifth were better (5 habitats/15 niches and 3 habitats/9 niches, respectively), but still orders of magnitude fewer than the enormous richness of habitats and niches on our planet.
2. The simulations omit key forces thought to be at work in evolution.
- No gene duplication - The EvolSimulator has gene duplication capabilities, but they evidently went unused in Ewert’s 5 simulations.
- Changes in habitats and niches - The habitats and niches are static in the EvolSimulator, but in real life they are subject to change and, more rarely, dramatic shocks (e.g., the K-T boundary event).
3. The simulation misrepresents the quality of real-world data. Accurately representing the data quality would have resulted in different validation results.
- Real-world genomes are incomplete. But the simulation included all of the simulated genes for simulated populations.
- Most real-world genomes are not present in the genomic catalogs. There are 7.7 M extant metazoan species, but orders of magnitude fewer genomes in the catalogs (EggNog, Ensembl, Uniref-50, etc.)
- To model the extant genomic data more accurately, the simulations should have randomly excluded 99.999% of simulated species and X percentage of each simulated species’ genes. @glipsnort asserts that performing these exclusions would have resulted in a strong favoring of the dependency graph model over the tree model even in the simulation, and I agree. Here’s why: omitting the overwhelming majority of tree structure leaves only a few disconnected branches, and a few disconnected branches are more readily modeled by a dependency graph than by an evolutionary tree. (I do not know how much of the genomes are generally missing from existing genome catalogs, so I cannot specify the value of X.)
4. Overall, the simulation data are vastly oversimplified compared to real-world data. Thus it is not surprising that the relatively simple tree model would win the goodness-of-fit comparison with the (penalized) more complex graph dependency model. Because the real world data are vastly more complex than the simulation data, however, the attempted validation fails.
Conclusion 2: The oversimplified tree model results in underfitting bias
While the paper laudably applies a correction for the overfitting tendency of the inherently complex dependency graph model, it oversimplifies the tree model without correcting the resultig tendency to underfit. Ewert notes the significantly lower complexity of his tree model vis-a-vis the theory of evolution:
An obvious objection is that we have not included any of the mechanisms thought to account for nonhierarchical data such as incomplete lineage sorting, gene flow, convergent evolution, or horizontal gene transfer. As such, it might be argued that any of the features of the data interpreted as evidence for the dependency graph may also be explained by these mechanisms.
Let’s take a look at why this lack of complexity could very well have confounded the goodness-of-fit comparison. Here’s how complexity (on the X axis) and model generalization (on the Y axis) can be depicted:
I draw your attention to the green U-shaped curve. When a model is insufficiently complex, it is in the high bias region on the left, and predicts poorly both on labeled data and unseen data. When it is too complex, it is in the high variance region on the right, and predicts poorly on unseen data even though it predicts labeled data quite accurately.
It is highly plausible–perhaps a biologist would say highly likely–that Ewert’s exclusion of known, complex factors described by the theory of evolution pushes the tree model very far into the high bias region on the left when it is dealing with complex, real-world data.
A quote attributed to Einstein says it with pith:
Everything should be made as simple as possible, but not simpler. [emphasis mine]
Ewert already acknowledges that the dependency graph model is significantly more complex than than the tree model.
The question thus is: does the dependency graph fit the biological data better enough to warrant
its additional complexity?
I therefore fail to understand why @Cornelius_Hunter pleads the case for DG by claiming that the conventional evolution model is too complex. True, the design model can be melded with the DG model with a very simple summary–something like, “Constructing a genome from dependent modules is a design process.” That simple summary is a facade over tremendous complexity, just as a steering wheel and an accelerator hide a whole lot of complexity in an automobile.
3. Ewert’s simulations cannot be replicated without additional information
The EvolSimulator v.2.1.0 has 68 parameters parameters (in addition to logging and output format parameters), but Ewert’s paper only specifies settings for 10 of them. Settings for the -s randomSeed gene and -z randomSeed genome would I think be especially helpful, because they would allow other researchers to exactly replicate Ewert’s results.
Fortunately, this problem is easily remedied: Ewert only needs to publish the Python scripts that he used to generate the 5 simulations.
For these 3 reasons, it is in my opinion premature to make any claims for one model or the other based on Ewert’s paper. However, I hope that further work can address the issues that I and others have raised, and allow a fruitful comparison to be made.
I look forward to seeing what the DG effort might produce. So far I do not think it has produced anything that would disturb the current theory. If and when it is able to account for the many classes of data better than the existing theory, then I am sure the Nobel Committee will be hearing about it.
BTW, I wanted to get back to the general point, made several times, by several people, that the CD model in the paper is inadequate because it fails to include the various additional mechanisms (homoplasy, ILS, duplication, deletion, etc.). Some have stated that the test presented in the paper is therefore invalid. I wanted to address this, because there are two important points that need to be understood about this.
First, this goes both ways. That is, additional mechanisms can be applied to DG as well.
Second, the additional mechanisms will not be for free. They will penalize CD, and this penalty is very important in model selection. It has to be, and this has been borne out in real data analytics. I have seen this myself. Modeling terms that I thought were important and legitimate were thrown out in the model selection process. This is important. If you do not do this, you will end up with a model that fails in its predictions. It looks great at the training stage, but fails when used with new data. This is a real problem, well understood by data analysts.
You may say, “well, tough, that’s the way biology is.” Well fine, but if so, you have a real uphill battle. For you are up against a model which has an enormous head start. Once you begin to add those add-on mechanisms, you will incur cost. And also, add-on mechanisms would be available to DG as well.
I’m not saying this is an impossible task. Perhaps CD can somehow be shown to be better than DG, but that appears quite unlikely.
6 posts were merged into an existing topic: Why Aren’t the Twin Locations of >100k+ ERV’s (human vs. chimp) Discussed More?
Well I didn’t do that. My point was that if add-ons are allowed on one side, then they need to be allowed on the other side as well. But there is a cost …