New Paper Demonstrates Superiority of Design Model

Okay. It seems the mechanism is the intelligent designer has some modules he’s constrained himself to work with and then grabs relevant pieces from each module to intelligently design each original pair (if reproducing sexually) for each species? Would there be any such thing in this frame as even ‘mini-common descent?’ That is where at least horses and zebras could have shared a common ancestor or not even that?

I have to say that I am shocked by the implication that any theory of science (and least of all evolution) would be spoken as similar to belief in the resurrection of Christ. Perhaps you misspoke.

I will add that when any area of science is considered beyond doubt and examination, that ceases to be science.

Well that’s an interesting question, because the DG model is kind of a superset, or generalization, of CD.

Before I give a critique of Ewert’s paper, I want to acknowledge that it represents an important step in the debate over the models of origins. Ewert was not content to point out perceived flaws in conventional theory; he did his best to advance an alternative theory and to evaluate the two theories with respect to a certain class of genomic data. This approach warms my heart.

I do want to mention that I read through the paper only once, so I would welcome any corrections of points I might have misconstrued or just out-and-out overlooked.

In my opinion Ewert’s approach unfortunately suffers from flaws that prevent it from being used to draw any useful conclusions or comparisons. I do not believe these flaws are beyond remedy in future research, however. I hope that Ewert and others can address the methodological critiques of biologists like @glipsnort, @swamidass, @Sy_Garte, @sfmatheson, and @T_aquaticus in the near future.

Since I have graduate education and decades of working experience in computational methods and model evaluation, I will stick out my neck and offer a critique that I hope will be of some value, as well. I offer three main observations.

  1. The comparison of tree vs. graph dependency predictions on simulated data is inapplicable to the biological data. Consequently, the research method is not validated.
  2. The tree model is oversimplified in comparison to evolutionary mechanisms, resulting in underfitting and inferior predictions with respect to cross-genomic data.
  3. Ewert’s simulation results cannot be replicated.

Conclusion 1: The simulation was invalid
1. The simulation did not model the biological data subsequently analyzed.

  • Too few iterations by several orders of magnitude. Per the standard chronology that I do not believe Ewert (or Hunter) disputes, metazoans first appeared on earth at least 665 MYA. If we assume that the typical metazoan generation since then is very approximately 0.66 - 6.6 years, the simulation should have traversed 107 - 108 generations. However, the 5 simulations only traversed 104 generations.
  • Too few habitats and niches by orders of magnitude - The first 3 simulations had only 1 habitat and a few niches. The fourth and fifth were better (5 habitats/15 niches and 3 habitats/9 niches, respectively), but still orders of magnitude fewer than the enormous richness of habitats and niches on our planet.

2. The simulations omit key forces thought to be at work in evolution.

  • No gene duplication - The EvolSimulator has gene duplication capabilities, but they evidently went unused in Ewert’s 5 simulations.
  • Changes in habitats and niches - The habitats and niches are static in the EvolSimulator, but in real life they are subject to change and, more rarely, dramatic shocks (e.g., the K-T boundary event).

3. The simulation misrepresents the quality of real-world data. Accurately representing the data quality would have resulted in different validation results.

  • Real-world genomes are incomplete. But the simulation included all of the simulated genes for simulated populations.
  • Most real-world genomes are not present in the genomic catalogs. There are 7.7 M extant metazoan species, but orders of magnitude fewer genomes in the catalogs (EggNog, Ensembl, Uniref-50, etc.)
  • To model the extant genomic data more accurately, the simulations should have randomly excluded 99.999% of simulated species and X percentage of each simulated species’ genes. @glipsnort asserts that performing these exclusions would have resulted in a strong favoring of the dependency graph model over the tree model even in the simulation, and I agree. Here’s why: omitting the overwhelming majority of tree structure leaves only a few disconnected branches, and a few disconnected branches are more readily modeled by a dependency graph than by an evolutionary tree. (I do not know how much of the genomes are generally missing from existing genome catalogs, so I cannot specify the value of X.)

4. Overall, the simulation data are vastly oversimplified compared to real-world data. Thus it is not surprising that the relatively simple tree model would win the goodness-of-fit comparison with the (penalized) more complex graph dependency model. Because the real world data are vastly more complex than the simulation data, however, the attempted validation fails.

Conclusion 2: The oversimplified tree model results in underfitting bias
While the paper laudably applies a correction for the overfitting tendency of the inherently complex dependency graph model, it oversimplifies the tree model without correcting the resultig tendency to underfit. Ewert notes the significantly lower complexity of his tree model vis-a-vis the theory of evolution:

An obvious objection is that we have not included any of the mechanisms thought to account for nonhierarchical data such as incomplete lineage sorting, gene flow, convergent evolution, or horizontal gene transfer. As such, it might be argued that any of the features of the data interpreted as evidence for the dependency graph may also be explained by these mechanisms.

Let’s take a look at why this lack of complexity could very well have confounded the goodness-of-fit comparison. Here’s how complexity (on the X axis) and model generalization (on the Y axis) can be depicted:

I draw your attention to the green U-shaped curve. When a model is insufficiently complex, it is in the high bias region on the left, and predicts poorly both on labeled data and unseen data. When it is too complex, it is in the high variance region on the right, and predicts poorly on unseen data even though it predicts labeled data quite accurately.

It is highly plausible–perhaps a biologist would say highly likely–that Ewert’s exclusion of known, complex factors described by the theory of evolution pushes the tree model very far into the high bias region on the left when it is dealing with complex, real-world data.

A quote attributed to Einstein says it with pith:

Everything should be made as simple as possible, but not simpler. [emphasis mine]

Ewert already acknowledges that the dependency graph model is significantly more complex than than the tree model.

The question thus is: does the dependency graph fit the biological data better enough to warrant
its additional complexity?

I therefore fail to understand why @Cornelius_Hunter pleads the case for DG by claiming that the conventional evolution model is too complex. True, the design model can be melded with the DG model with a very simple summary–something like, “Constructing a genome from dependent modules is a design process.” That simple summary is a facade over tremendous complexity, just as a steering wheel and an accelerator hide a whole lot of complexity in an automobile.

3. Ewert’s simulations cannot be replicated without additional information

The EvolSimulator v.2.1.0 has 68 parameters parameters (in addition to logging and output format parameters), but Ewert’s paper only specifies settings for 10 of them. Settings for the -s randomSeed gene and -z randomSeed genome would I think be especially helpful, because they would allow other researchers to exactly replicate Ewert’s results.

Fortunately, this problem is easily remedied: Ewert only needs to publish the Python scripts that he used to generate the 5 simulations.

Conclusion

For these 3 reasons, it is in my opinion premature to make any claims for one model or the other based on Ewert’s paper. However, I hope that further work can address the issues that I and others have raised, and allow a fruitful comparison to be made.

Miscellany

I look forward to seeing what the DG effort might produce. So far I do not think it has produced anything that would disturb the current theory. If and when it is able to account for the many classes of data better than the existing theory, then I am sure the Nobel Committee will be hearing about it. :slight_smile:

Best regards,
Chris

6 Likes

BTW, I wanted to get back to the general point, made several times, by several people, that the CD model in the paper is inadequate because it fails to include the various additional mechanisms (homoplasy, ILS, duplication, deletion, etc.). Some have stated that the test presented in the paper is therefore invalid. I wanted to address this, because there are two important points that need to be understood about this.

First, this goes both ways. That is, additional mechanisms can be applied to DG as well.

Second, the additional mechanisms will not be for free. They will penalize CD, and this penalty is very important in model selection. It has to be, and this has been borne out in real data analytics. I have seen this myself. Modeling terms that I thought were important and legitimate were thrown out in the model selection process. This is important. If you do not do this, you will end up with a model that fails in its predictions. It looks great at the training stage, but fails when used with new data. This is a real problem, well understood by data analysts.

You may say, “well, tough, that’s the way biology is.” Well fine, but if so, you have a real uphill battle. For you are up against a model which has an enormous head start. Once you begin to add those add-on mechanisms, you will incur cost. And also, add-on mechanisms would be available to DG as well.

I’m not saying this is an impossible task. Perhaps CD can somehow be shown to be better than DG, but that appears quite unlikely.

6 posts were merged into an existing topic: Why Aren’t the Twin Locations of >100k+ ERV’s (human vs. chimp) Discussed More?

Well I didn’t do that. My point was that if add-ons are allowed on one side, then they need to be allowed on the other side as well. But there is a cost …

Out of interest, what is your purpose in discussing this paper? By that I mean, is there a theory that is supported by this work, or is it to compare simulations with CD and seek to replace CD with something else, such a dependency (something)? Do you have a theory of dependency and if so, what is it?

2 Likes

Hello GJDS,

Perhaps I misspoke, or perhaps you misunderstood? I don’t even understand where you’re coming from with this comment, and I say this not with an exasperated tone but an honestly bewildered, brotherly tone. Similar in what way? And how did I imply that? How is my comment different from a thousand other discussions of miracles and resurrections and naturalism on the Forum?

I was merely trying to understand Cornelius’s mental image of an “evolutionist.” He was painting what seemed to me to be a straw man of “everybody who believes in evolution must of necessity be of Richard Dawkins’ ilk.” I was reassured to find, upon further prodding, that I was wrong, and that he does allow that there are people who accept evolution who also believe in miracles such as the resurrection of Jesus. This is a (small) win for gracious Christian dialogue, even if my approach (using the term “denialist” in particular) was less than 100% gracious.

haha Well I should hope so!

I admit that the use of “denialist” was a bit of a barb, and if our dear moderators (particular @BradKramer) had been watching more carefully, it might have been removed just like my other snarky comment at the outset of this thread.

It’s kind of you to ask, but I don’t feel I could add anything worth reading beyond the thoughtful (and not unkind) critiques that have been offered by several on this Forum and over at Peaceful Science.

I think overall what I see is that Dr. Ewert has offered the tentative beginnings of a research program, and he has admirably entered the ring to test it, and you have lifted up his arm and declared him the undisputed winner by KO before the fight has even really begun.

7 Likes

It seems that I misunderstood and for that I am sorry: :sweat: .

3 Likes

The null hypothesis is randomly distributed characters. The null hypothesis would also be able to capture Ewert’s model because there would be massive discordance between modules of genes. I’m not talking about small incongruencies like the deep nodes shared between yeast species, but massive discordances like jellyfish and a rodent group being placed in the same group with other rodent species in the outgroup.

1 Like

Occam’s razor is also being misused in the paper. The razor slices away extra assumptions, not complexity. It isn’t the simplest answer that is preferred, but the one with the fewest assumptions. A complex explanation with fewer assumptions is preferred over a simple explanation with more assumptions.

3 Likes

I think that sums it up. He is being suitably cautious, but he needs to be given space to explore and respond to the questions people are asking. I’m not convinced yet, but I like Dr Ewert’s attitude.

6 Likes

What I find interesting about this approach is that it is very macro - scale. In this way, it doesn’t have to deal with the nitty-gritty one encounters when comparing two closely-related genomes (say for example, the human and chimpanzee genomes). It’s almost as if the anti-evolution ID group has realized they just can’t win when doing close comparisons.

I too commend Ewert for putting his ideas out there. But the skeptic in me wants to know how he deals with more the pressing problems for his idea that closely-related genomes generate. The way forward for a research program is not to try find a niche where your favourite ideas are protected from the data…

6 Likes

If I am reading Ewert’s thesis correctly, there shouldn’t be any closely related organisms. What we should see is a collection of species which are a mish mash of different modules thrown together. We should see one species with modules from fish, birds, and whales and another species with modules from lizards, bats, and jellyfish.

For example, we can look at Fig. 3 and this quote:

“All marine species depend on a marine module, and the echolocating species depend on the echolocation module. The dependency graph is essentially a tree with extra flexibility; the modules can explain genes shared between species thought to be only distantly related by common descent. A module is not restricted to reusing code from a single source, but can freely reuse from multiple sources. Compare this to common descent where each species must almost exclusively draw from a single source: its ancestral species.”

I am going to go out on a limb and say that Ewert is probably talking about the Prestin gene. With that caveat, I will assume that this is what he is talking about and show how this creates problem for his hypothesis.

First off, humans have a Prestin gene. We have the supposed echolocating module. I believe that almost all mammals have this gene, regardless of their ability to use echolocation or live in a marine environment. In fact, a search for “Prestin” at Homologene shows that all eukaryotes have this gene, down to C. elegans. I guess Ewert thinks little nematodes are echolocators because they have this module?

I suspect that Ewert is misinterpreting the pseudo-controversy over convergent synonymous mutations in the bat and cetacean Prestin genes, and making a few leaps of logic from that misunderstanding. If Ewert is not talking about the Prestin gene, then I would really like to know what genes bats and dolphins share that are not shared by other mammals.

Reading further, Ewert seems to be arguing for convergent sequence evolution with respect to echolocating and marine mammals, but then abandons that whole argument for gene families. Why? I suspect that even with a few convergent nonsynonymous mutations, the overall evolution of these genes at the DNA level produces the trees predicted by common descent, so it doesn’t fit into the story that Ewert wants to tell.

3 Likes

After reading the paper, I suspect you’re correct.

Yep. It’s in the original Prestin dolphin / bat paper. The Prestin tree fits the prediction of common ancestry once you remove the influence of the (very few) amino acids that were under convergent selection.

If this is what Ewert is basing his thinking on for the “echolocation module” - and I agree, it seems pretty likely that this is the case - then he’s not dealing with evidence that does not support his ideas.

2 Likes

Or, put another way, it’s going to be hard for him to argue that the Prestin gene isn’t part of his proposed “echolocation module.”

So, one of the key genes in one of his modules has all the signs of common ancestry with convergent evolution layered over top. How does that square with his hypothesis that modules are designed apart from common ancestry?

It’s these sorts of things that skeptical, appropriately critical, peer review would address.

4 Likes

Population genetics and phylogenetics isn’t my strong suit, so what I say isn’t gospel (pun intended) . . .

There are also assumptions that are used in these models. For example, it is assumed that a difference at a specific base is due to one mutation. It is also assumed that if two species have the same base at the same position then no mutations have happened. This is called parsimony. It is entirely possible that the same mutation happened independently in two lineages resulting in the same base at the same position. It is also possible that two mutations have happened at the same base in one lineage instead of just one mutation. As evolutionary distance increases the chances of these occurrences increases as well. I believe some models take these possibilities into account, but I could be wrong. A search for “DNA homoplasy” returns 27,000+ papers, so this is a well known source of noise in phylogenies.

You also have incomplete lineage sorting, which is analogous to your sports betting scenario. When heterozygosity (i.e. a mix of alleles for a given gene) is inherited by two lineages after a speciation event there will be winners and losers as those alleles compete with one another. Sometimes the game goes on for a long time and the heterozygosity is kept. Sometimes the same allele wins in both lineages. Sometimes different alleles win in each lineage. This can cause two more distantly related species to share an allele not shared by the more closely related species. It is a known source of noise, and from what I have read it can be modeled if you know something about the populations and the approximate time for each speciation event.

In the end, there are known sources of noise that will accompany the phylogenetic signal. It is expected. It is no different than any other scientific pursuit. I don’t know of any scientific endeavour where there is a r^2 of 1.0000000 for every single data set, and if there were perfect correlations it would make me very skeptical of the reported results. It is strange that Cornelius Hunter cites noise as a reason to doubt the results because most scientists, in my opinion, cast much more doubt on data that has no noise.

1 Like

No, this is a bizarre assertion. Ewert’s model is not randomly distributed characters. Either you really don’t understand the math, or something.