Nested Clades, The Consistency Index, and Affirming the Consequent

Perfect trees always have a consistency index (CI) score of 1. A CI score less than 1 means a non-tree DAG fits the data better than a tree. If you look at the Klassen graph, almost all the studies are less than 1, and many are much less than 1, especially as you go into higher taxa. If you subtract CI from 1 you get another metric called the homoplasy index (HI), which is a measure of how untreelike the data is, i.e. how often the same character shows up in completely different clades.

The homoplasy index (HI) is simply 1 − CI.

So, if we flip the Klassen 1991 graph on its head, we get a graph of HI vs taxa count. As you see, the results actually show the more data we get, the better it is fit by an untreelike DAG, i.e. much closer to what we see in design scenarios. In fact, the ‘phylogenetic signal’ would be more aptly named the ‘design signal’. This is also why I am finding it so easy to generate ‘high’ CI scores from graph structures that don’t look anything like an evolutionary tree.

A tree with no convergence would have a consistency index of 1. How you calculate CI will determine whether you get a high consistency index or not with unresolved branches - how are you handling the lack of resolution? Basically, it’s an apples and oranges problem - a CI value for a tree that is not completely bifurcating is not measuring the same thing as a tree that is completely bifurcating, just as comparing tree statistics generated by different data sets is not generally very meaningful. What version of PAUP are you using? In the current PAUP* 4.0a168 (beta testing), if you scroll down the analysis menu, bootstrap/jackknife gives you ways to test the strength of groupings in the tree through different resampling strategies. Note that likelihood analyses are not too fast to begin with and the bootstrap or jackknife involves repeated analyses, so you need to allow a fair amount of time for it to give reasonable numbers. If you want to try parsimony analyses, TNT analyzes the data faster but is not as user-friendly - you’d want to build your file in PAUP and then analyze it with TNT.

Note that running PAUP*4, using your data set of
#NEXUS
Begin data;
Dimensions ntax=4 nchar=2;
Format datatype=DNA gap=- missing=X;
Matrix
taxon1 GA
taxon2 AT
taxon3 TC
taxon4 CG;
End;

a branch and bound search gives three trees with a CI of 1 (describe trees) but also it lists a CI excluding uninformative characters, and that CI is zero. The CI excluding uninformative characters is what is meaningful for assessing the type of question you are asking. Any data set where none of the examples actually share any features will automatically give a CI of 1, but none of those features tell any sort of phylogenetic analysis anything meaningful - there needs to be some commonalities to work with.

An appropriate comparison would be analyzing a data set made of several long random strings of characters - for convenience, say A, G, T, and C, and see how the results for that data set compares with a set of actual DNA sequences that are appropriate for a particular group of organisms. (By appropriate, I mean that it needs to fall somewhere between having essentially no change, which again would give high CI but be highly uninformative and having so much change that it is essentially randomized. As DNA has only A, G, T, and C to choose from, a sequence that mutates enough will have random matches with other sequences.)

The actual patterns observed with analyzing data for real organisms closely matches the expectations of an evolutionary pattern, which is to have pretty good nested clades but also some convergence, random variation, and other “noise” [from the point of view of someone trying to figure out the evolutionary relationships, those are noise, but not necessarily to someone asking other types of questions.]

2 Likes

As previously pointed out, the paper by Farris available at https://onlinelibrary.wiley.com/doi/epdf/10.1111/j.1096-0031.1989.tb00573.x discusses limits of the consistency index and why some modifications can be advantageous to highlight where the consistency index alone is not very informative. For data from real organisms, a high consistency index suggests that the analysis is doing a reasonable job of matching the actual evolutionary pattern. However, there are likely to be a huge number of trees that differ very little in CI. A data set with very few potential synapomorphies will not give a very meaningful CI. For example, the sample data sets largely had each taxon unique, with no shared similarities. A data set where almost all taxa had nearly identical sequences would also not give a very meaningful CI. The CI would, however, distinguish between a set of features that originated by an evolutionary pattern versus one that originated from a “mix and match” approach. In principle, an intelligent designer could give bats and birds matching genes for wings and bats and cats matching genes for reproduction, etc., so that the similarities do not follow any consistent pattern. Such a scenario would yield a low CI. Of course, a design hypothesis cannot be tested unless it is specified how that design took place; a designer would not have to follow such an approach, but it is similar to a popular type of ID model. Conversely, bacteria can pick up all sorts of random DNA, so there is a good deal of mixing and matching without intelligent intervention.

1 Like

“Let your conversation be always full of grace, seasoned with salt, so that you may know how to answer everyone.” -Colossians 4:6

This is a place for gracious dialogue about science and faith. Please read our FAQ/Guidelines before posting.