Before I give a critique of Ewert’s paper, I want to acknowledge that it represents an important step in the debate over the models of origins. Ewert was not content to point out perceived flaws in conventional theory; he did his best to advance an alternative theory and to evaluate the two theories with respect to a certain class of genomic data. This approach warms my heart.
I do want to mention that I read through the paper only once, so I would welcome any corrections of points I might have misconstrued or just out-and-out overlooked.
In my opinion Ewert’s approach unfortunately suffers from flaws that prevent it from being used to draw any useful conclusions or comparisons. I do not believe these flaws are beyond remedy in future research, however. I hope that Ewert and others can address the methodological critiques of biologists like @glipsnort, @swamidass, @Sy_Garte, @sfmatheson, and @T_aquaticus in the near future.
Since I have graduate education and decades of working experience in computational methods and model evaluation, I will stick out my neck and offer a critique that I hope will be of some value, as well. I offer three main observations.
- The comparison of tree vs. graph dependency predictions on simulated data is inapplicable to the biological data. Consequently, the research method is not validated.
- The tree model is oversimplified in comparison to evolutionary mechanisms, resulting in underfitting and inferior predictions with respect to cross-genomic data.
- Ewert’s simulation results cannot be replicated.
Conclusion 1: The simulation was invalid
1. The simulation did not model the biological data subsequently analyzed.
- Too few iterations by several orders of magnitude. Per the standard chronology that I do not believe Ewert (or Hunter) disputes, metazoans first appeared on earth at least 665 MYA. If we assume that the typical metazoan generation since then is very approximately 0.66 - 6.6 years, the simulation should have traversed 107 - 108 generations. However, the 5 simulations only traversed 104 generations.
- Too few habitats and niches by orders of magnitude - The first 3 simulations had only 1 habitat and a few niches. The fourth and fifth were better (5 habitats/15 niches and 3 habitats/9 niches, respectively), but still orders of magnitude fewer than the enormous richness of habitats and niches on our planet.
2. The simulations omit key forces thought to be at work in evolution.
- No gene duplication - The EvolSimulator has gene duplication capabilities, but they evidently went unused in Ewert’s 5 simulations.
- Changes in habitats and niches - The habitats and niches are static in the EvolSimulator, but in real life they are subject to change and, more rarely, dramatic shocks (e.g., the K-T boundary event).
3. The simulation misrepresents the quality of real-world data. Accurately representing the data quality would have resulted in different validation results.
- Real-world genomes are incomplete. But the simulation included all of the simulated genes for simulated populations.
- Most real-world genomes are not present in the genomic catalogs. There are 7.7 M extant metazoan species, but orders of magnitude fewer genomes in the catalogs (EggNog, Ensembl, Uniref-50, etc.)
- To model the extant genomic data more accurately, the simulations should have randomly excluded 99.999% of simulated species and X percentage of each simulated species’ genes. @glipsnort asserts that performing these exclusions would have resulted in a strong favoring of the dependency graph model over the tree model even in the simulation, and I agree. Here’s why: omitting the overwhelming majority of tree structure leaves only a few disconnected branches, and a few disconnected branches are more readily modeled by a dependency graph than by an evolutionary tree. (I do not know how much of the genomes are generally missing from existing genome catalogs, so I cannot specify the value of X.)
4. Overall, the simulation data are vastly oversimplified compared to real-world data. Thus it is not surprising that the relatively simple tree model would win the goodness-of-fit comparison with the (penalized) more complex graph dependency model. Because the real world data are vastly more complex than the simulation data, however, the attempted validation fails.
Conclusion 2: The oversimplified tree model results in underfitting bias
While the paper laudably applies a correction for the overfitting tendency of the inherently complex dependency graph model, it oversimplifies the tree model without correcting the resultig tendency to underfit. Ewert notes the significantly lower complexity of his tree model vis-a-vis the theory of evolution:
An obvious objection is that we have not included any of the mechanisms thought to account for nonhierarchical data such as incomplete lineage sorting, gene flow, convergent evolution, or horizontal gene transfer. As such, it might be argued that any of the features of the data interpreted as evidence for the dependency graph may also be explained by these mechanisms.
Let’s take a look at why this lack of complexity could very well have confounded the goodness-of-fit comparison. Here’s how complexity (on the X axis) and model generalization (on the Y axis) can be depicted:
I draw your attention to the green U-shaped curve. When a model is insufficiently complex, it is in the high bias region on the left, and predicts poorly both on labeled data and unseen data. When it is too complex, it is in the high variance region on the right, and predicts poorly on unseen data even though it predicts labeled data quite accurately.
It is highly plausible–perhaps a biologist would say highly likely–that Ewert’s exclusion of known, complex factors described by the theory of evolution pushes the tree model very far into the high bias region on the left when it is dealing with complex, real-world data.
A quote attributed to Einstein says it with pith:
Everything should be made as simple as possible, but not simpler. [emphasis mine]
Ewert already acknowledges that the dependency graph model is significantly more complex than than the tree model.
The question thus is: does the dependency graph fit the biological data better enough to warrant
its additional complexity?
I therefore fail to understand why @Cornelius_Hunter pleads the case for DG by claiming that the conventional evolution model is too complex. True, the design model can be melded with the DG model with a very simple summary–something like, “Constructing a genome from dependent modules is a design process.” That simple summary is a facade over tremendous complexity, just as a steering wheel and an accelerator hide a whole lot of complexity in an automobile.
3. Ewert’s simulations cannot be replicated without additional information
The EvolSimulator v.2.1.0 has 68 parameters parameters (in addition to logging and output format parameters), but Ewert’s paper only specifies settings for 10 of them. Settings for the -s randomSeed gene and -z randomSeed genome would I think be especially helpful, because they would allow other researchers to exactly replicate Ewert’s results.
Fortunately, this problem is easily remedied: Ewert only needs to publish the Python scripts that he used to generate the 5 simulations.
Conclusion
For these 3 reasons, it is in my opinion premature to make any claims for one model or the other based on Ewert’s paper. However, I hope that further work can address the issues that I and others have raised, and allow a fruitful comparison to be made.
Miscellany
I look forward to seeing what the DG effort might produce. So far I do not think it has produced anything that would disturb the current theory. If and when it is able to account for the many classes of data better than the existing theory, then I am sure the Nobel Committee will be hearing about it.
Best regards,
Chris