Signal vs. Noise, Part 2: Hunter Opens the Klassen Study Again

Well think of it this way. If you look at the sky you’ll see that everything goes around the Earth. It is an undeniable pattern that must be explained. You could say the cosmos exhibits an approximate geocentric pattern. Copernicus published his heliocentrism which had many similarities to geocentrism. You move the Sun to (roughly) the center, but otherwise everything still travels in circles. The so-called “Copernican Revolution,” as it has been constructed, is largely a myth. The supposed cultural shift by moving Earth from the center is highly exaggerated. And Copernicus’ model wasn’t very good either. But it was an important moment in model improvement.

So regarding Ewert’s paper and that quote, it certainly is true that at a glance the species can give the impression of nested hierarchy. That was the idea from Aristotle to Linnaeus. So there is something there. The nested hierarchy model is simpler, and easier to conceptualize than the DG model. And the latter can be fitted into the former. That is, if you simulated a DG process, and constructed a biological world of genes and species, and then you fed the data into an evolutionary tree phylogenetic algorithm, it would return a reasonable tree. There would be the need for many additional mechanisms (homoplasies, divergences, etc.), but it would work. Simply put, a DG can be shoehorned into a CD model. This is what the paper was getting at. We can say that the species exhibit an approximate nested hierarchy, with the understanding that “approximate” here can mean a pretty lousy model, just like geocentrism is pretty lousy.

That is strikingly bold rhetoric, Dr. Hunter, considering that your “heliocentrism” can make almost no predictions about the orbits of planets, moons, asteroids, etc.

Here’s another metaphor that is apt for our discussion:

In the land of the blind, the one-eyed man is king.

The theory of evolution may have somewhat blurry vision, but it’s the only theory in the hypothesis space that is making any predictions with respect to the vast body of evidence.

Grace and peace,
Chris

The DG model has not been fitted to sequence data, as even Ewert explained. There is currently no DG explanation for the phylogenetic signal in sequence data. As noted by @glipsnort , the DG model can’t even predict the pattern of sequence differences between species with respect to transitions, transversions, and CpG mutations. The DG model also can’t explain orthologous ERVs, genetic equidistance, and the divergence of introns and exons. The common descent model can explain all of these pieces of sequence data, but the DG model does not.

Klassen’s statement is not elegantly phrased. Notwithstanding that, I submit with all due respect that you have misread Klassen here.

He does not say that you cannot use the 49 data sets to reject the null hypothesis.

Instead, what he is saying is that the rejection of the null hypothesis (on the one hand) does not necessarily imply being able to identify with confidence the single cladogram, within the permutation space of all possible cladograms, that best fits a particular data set (on the other hand). Single best fit cladograms, one per data set, are the “phylogenetic conclusions” to which Klassen is referring.

This is a common problem in conducting a search over an NP-hard problem domain. You can use some sort of hill-climbing to reach a local maximum, but once you get there, can you be 100% confident it is the global maximum? This is the problem Klassen is dealing with. He is not disputing the ability to identify the existence of a mountain range (a phylogeny) as opposed to the prairie land of randomness. Nor is he disputing the ability to find the highest peak within the portion of the range that has been searched. He is simply pointing out that the methods that existed in 1991, and perhaps even the methods that exist today, are not able to claim with 100% confidence that the best-fitting phylogeny that emerges from a necessarily constrained analysis is the best-fitting phylogeny for a domain under study.

Klassen refers to two other publications to support his conclusion:

  • FAITH, D. P., AND P. S. CRANSTON. 1991. Could a cladogram this short have arisen by chance? On permutation tests for cladistic structure. Cladistics 7:1-28.
  • FARRIS, J. S. 1972. Estimating phylogenetic trees from distance matrices. Am. Nat. 106:645-668.

Due to paywall constraints I have only been able to read the abstracts of these publications. The abstracts very clearly deal with the frustrations of finding a best fit cladogram, rather than with the question of whether the null hypothesis of randomness is overcome. The ability to reject the null hypothesis is not questioned in the least by the publications that Klassen refers to, at least as far as I can tell by reading the abstracts.

Thanks, and have a great southern California day.

Chris

You can’t use DNA (or protein) sequences. That’s the problem. By using sequence data, you are prefiltering to using only sequences that are present in all the species in the data set. Hence it isn’t accurate. It is a self-fulfilling prophecy. If you prefilter to only have data that support your theory, then of course, you’ll get very nicely behaved data. But if you are interested in realism, then you will look at the preponderance of the evidence. (I’m repeating myself from another thread, but you brought up this subject).

Hello Dr. Hunter,

I don’t understand this assertion. My understanding is that research studies use homologous sequences so that they can measure Levenshtein distances. Shorter distances are correlated with closer relationships, longer distances with more distant relationships. If you don’t use homologous sequences, however, then you have no way to measure distance.

As long as a sufficiently long sequence, or multiple sequences, is/are used, however, the null hypothesis is that multiple trees that emerge from the analysis of different segments should be no more similar than would occur by chance. If there is no historical basis of common descent, a tree m that is predicted only by happenstance from one segment is highly unlikely to be similar to tree n from another segment.

If all of the segments yield the same or very similar trees, however, then the concordance can become statistically significant evidence of a history of common descent.

For that matter, even if a single study yields a false positive due to a tiny sample size, the existence of other studies of different sequences for the same taxa would be quite unlikely to yield a similar tree if there were no historical basis in common descent.

Now if you could provide evidence that all of the phylogenetic studies are based on single, very short sequences, then I could understand your concern. Such an approach could indeed produce false phylogentic positives. However, I don’t think that you will be able to provide such evidence; all of the phylogenetic papers I have read from the past few years (admittedly, not that many) use multiple, long sequences in their analysis.

In addition to any citations you could provide, Dr. Hunter, I would also welcome input from others who are well-read in the literature such as @DennisVenema, @sfmatheson, @T_aquaticus, and @glipsnort,

Thanks,
Chris Falter

1 Like

While you are cogitating on this issue, Dr. Hunter, perhaps you could help us by describing the predictions a design model would make with regard to patterns (if any) in the Levenshtein distance of homologous DNA sequences in multiple taxa.

Thanks,
Chris Falter

1 Like

Evolutionists use homologous characters so they can measure distance.

Yes, precisely. That’s what I mean by “prefiltering.” An enormous wealth of data are filtered out. The methods themselves are theory-laden.

Let’s try again, Dr. Hunter.

If the design model cannot make any predictions regarding patterns (or lack thereof) in these studies of homologous sequences, then how are we supposed to do model selection in a scientific way?

Yes, every scientific research project is informed by theory. So I am not exactly sure what you are aiming at here.

I am going to go out on a limb and guess that it is your belief that heterologous sequences prove that evolution is inferior to some other unspecified theory which makes more accurate predictions about the simultaneous presence in a taxonomic analysis of both:

  • nested hierarchies in homologous sequences and
  • no hierarchy in heterologous sequences.

The theory of evolution accomplishes this through stochastic modeling that incorporates factors like incomplete lineage sorting, convergent evolution, copy-and-modification, etc.The theory of evolution actually has a model for the noise. But you think a model is already available which makes more robust, quantitatively accurate, and parsimonious predictions than evolution.

However, according to Winston Ewert, this unspecified theory cannot possibly be dependency graph analysis because DG makes no predictions as of today and likely for some time into the future with regard to sequential data.

If I have guessed incorrectly, kindly provide any corrections you think are suitable.

If I have guessed correctly, kindly name the competing theory and tell us what predictions it makes with regard to any patterns in sequence data, and how those predictions are derived.

Thanks,
Chris Falter

1 Like

Since this thread started with the Klassen 1991 metastudy, it’s worth pointing out that Klassen did not “prefilter” any of the traits or taxa from its Consistency Index analysis. Nevertheless, Klassen demonstrated an extremely strong signal for nested hierarchy. (Strong refers to the statistical significance of the signal.)

So no, support for nested hierarchy does not disappear in the absence of “prefiltering.”

Best,
Chris Falter

Why is that a problem? It’s kind of hard to sequence DNA that has been deleted in a lineage.

The theory of evolution predicts that phylogenies of sequenced orthologous DNA will recapitulate the phylogenies based on morphology. That is the prediction that is being tested. If the DG model does not make a prediction with respect to the differences and similarities in the sequence of orthologous DNA, then it is an inferior model to the theory of evolution.

HUH???

Why would the simple fact of sharing a DNA sequence guarantee that those shared sequences would recapitulate the phylogenies based on morphology? You need to explain this.

We already know from our own design programs that this isn’t true. We have inserted an exact copy of a jellyfish gene into mice which produces a DNA phylogeny that is completely different from the morphological phylogenies. Simply sharing DNA does not force a fit between phylogenies based on DNA and morphology.

1 Like

That’s because the theory of evolution makes predictions of what you will see in orthologous sequence. Therefore, you can use orthologous sequences to test the theory of evolution.

1 Like

You just answered your own question.

Please see:

If so, then evolution is false by modus tollens.

When a theory generates false predictions, it is not a very good theory.

That study uses orthologous genes, so I’m not sure what you are getting at. How are you supposed to compare genes if a gene is not found in one of the species?

Since you can’t come up with a reason, other than common ancestry and vertical inheritance, why phylogenies based on DNA sequence would recapitulate the phylogenies based on morphology then I don’t see what objection you can have for using orthologous genes.

There is a statistically significant phylogenetic signal, so the predictions have been supported.

1 Like

Dr. Hunter,

Hope your southern California weekend is going well.

You chose to analogize the theory of evolution to geocentrism. I am going to choose a different analogy that I believe to be more accurate and informative: X-ray crystallography and DNA structure.

Just as evolution predicts a statistically significant nested hierarchy structure in a taxonomy, biochemistry predicts that DNA can take on structural forms known as A-DNA and B-DNA. The test of the hypothesis is the similarity of the predicted X-ray crystallography images to the actual. And here I introduce some predicted and actual images from an article on quora.com, “How does one physically interpret the different diffraction patterns between A-DNA and B-DNA?”:

Now it would be possible to build a consistency index for the predicted vs. actual similarity. The CI could answer the question: for each pixel that is dark in the actual image, is the corresponding predicted pixel dark? Sum up the number of pixels for which the correspondence is true and divide by the number of dark pixels in the actual image. This approach would be very similar to the CI approach adopted by Klassen et al. in 1991, except that they were analyzing characters instead of pixels.

Without access to the original data, I cannot provide an exact CI for the A-DNA and B-DNA images. However, there are clearly a lot more dark pixels in the actual image than in the predicted. I would guess the CI is roughly 0.5 for A-DNA and roughly 0.25 for B-DNA, which has enormous black blobs at the top and bottom where a thin segment of dots is predicted.

The question is: should the actual data be interpreted as evidence for the predicted structures of A-DNA and B-DNA?

Answer #1 is:

No, the actual pixels are poor evidence for the theory. Certainly there is some similarity between predicted and actual. But just as you have to introduce epicycles into geocentrism to account for planetary orbits, you have to introduce extraneous factors to account for the CI values, which are far below 1.0. To the extent that you consider the theory of DNA structure to be good, it’s only because you are prefiltering the badly predicted pixels. Consequently we should consider the theory of DNA structure to be not a very good theory

Answer #2 is:

Yes, the actual pixels are powerful evidence for the theory. The probability of the null hypothesis for the actual images (null hypothesis = random placement of pixels due to no structure) is infinitesimal–something like 0.0000000005. Therefore the alternative hypothesis, A-DNA and B-DNA, should be accepted.

The actual images do contain significant noise in but we have known mechanisms to account for the noise.

It is also possible that some other, as-yet unidentified hypothesis might be even more consistent with the actual images than the A-DNA and B-DNA hypotheses. If that as-yet unidentified hypothesis survives peer review, then we can adopt it. But until that as-yet unidentified hypothesis shows up, we accept the A-DNA and B-DNA theory with a high degree of confidence.

The biochemistry community has adopted Answer #2, not Answer #1. We should do likewise for the theory of evolution and the Klassen data, as well as for the more recent, genomic-based phylogenetic studies.

Best regards,
Chris Falter

2 Likes

Let me try again. This traces back to the question about the Ewert paper not using sequence data, but rather presence/absence data. I explained that the problem with sequence data is that in order to align and compare seequences, this means the gene must be present in both species. So by definition, you are filtering out cases where the one species has the gene, but the other species lacks that gene. This is a case where you have a big difference between two species, but it is not being counted, but rather filtered out.

Your response was to say that, well, we need to have the gene present in both species in order to perform a sequence comparison. Yes, agreed, that is true. I am not disagreeing with your point, I am pointing out that you simply are reinforcing the problem which I pointed out. The data are “theory-laden.” This prefiltering removes data comparisons which are highly improbable on the theory. IOW, they do harm to the theory. They lower the probability your theory is true. (speaking in Bayesian terms here, of course).

Well it reflects the science. I’ll boil down just one aspect of this for you:

If you take computer software as an example (as in the Ewert paper), we know two things about it:

  1. It was designed, naturally falls into a dependency graph, and does not form a nested hierarchy.

  2. It overwhelmingly passes the common descent type of test you evolutionists are talking about.

Hence the caution from the Klassen paper. A dataset can pass the CD test, but be designed, and not be a nested hierarchy. Now, my question for you: What does that tell you about the CD test?

But how does this bias the results towards a nested hierarchy?

How so?

I could also cite a counterexample in the form of PtERV insertions in chimps and gorillas. Hundreds of insertions from this strain of retrovirus are found in the chimp and gorilla genomes, but not in the human genome. Given the accepted phylogeny of humans, chimps, and gorillas this indicates that these insertions had to occur after the split between the chimp and human lineages given the lack of PtERV insertions in the human genome. This also leads to the prediction that PtERV insertions should be found at non-orthologous positions in the chimp and gorilla genomes since they are independent insertions.

The prediction based on common ancestry is supported. All of the PtERV insertions in the chimp and gorilla genomes are found at different positions for those whose positions could be determined at single base resolution. This is a case of the common ancestry making predictions about non-orthologous genes and genetic features that aren’t shared, so I’m not sure why you think they are such a problem.

1 Like

Can you be more specific? How so … what?

Sure. I am referring to this part of your reply:

“I am pointing out that you simply are reinforcing the problem which I pointed out. The data are “theory-laden.” This prefiltering removes data comparisons which are highly improbable on the theory.”

How are gene deletions “highly improbable”? What exactly is “highly improbable”?

Also, I still don’t understand why using orthologous genes would bias the data towards a nested hierarchy. Can you explain this?

1 Like