As a quick introduction, my field of study is computational biology, with specific expertise in applying information theory to biology. My PhD is specifically in “Information and Computer Science”, with emphasis (in my case) on information. I graduated in 2007 from the PhD and have been working full time here since 2009.
One of the most surprising things for me over the last several years is the shift in anti-evolution arguments towards “information theory” arguments. Here is where great confusion abounds. And I thought I would give people a quick map of the situation.
To kick this off, I think this particular conversation is instructive…
Continuing the discussion from What is the Evidence for Evolution?:
I find this to be a very entertaining and instructive exchange. First off, information actually is exactly entropy. If we correctly compute information, we have just exactly computed the entropy. If we can exactly compute entropy, we have computed the information. This is the most important and fundamental result of Information Theory, laid out in the classic paper by Shannon: http://math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf.
So ENTROPY = INFORMATION.
Of course, this is in Information Theory, not Thermodynamics, that this result was first made. So it is no surprise that our friend here with a 1975 thermodynamics PhD is confused. So we find that the 2nd law of thermodynamics actually does (nearly) guarantee that information increases with time. I can explain the intuition of this to the curious, but it actually makes 100% sense when you understand what information is.
Now the very interesting thing about information (and entropy) is that has historically been derived several different ways.
- Thermodynamics (the first derivation of entropy)
- Shannon Information (usually computed assuming IID)
- Kolmogorov Complexity
- Minimum Description Length
- Compression (e.g. integer coding)
- Machine Learning/Model Building (a type of lossy compression)
- Auto-Encoding/Dimensionality Reduction//Embeddings
- Dynamical systems and Equation fitting
NOTE TO THE CONFUSED: This discussion can get confusing because of the many definitions of “information” in common speech, and also that there are two main types of information that work in different ways: (1) entropy or information content and (2) mutual information or the shared information. Information content is the amount of information in a single entity (and is measured as the entropy). Mutual information is the amount of information shared by multiple entities (and is a measure of commonality, and is equal to the difference between two entropies). When communicating with the public, it is hard to keep these two different types of information straight without devolving into dense technical language. But if you detect a contradiction in what I wrote, this is probably the reason. Because these two types of information behave similarly in some cases, and exactly opposite in others.
One of the really interesting things that happened over the last several decades is the discovery that all these different types of information (or approaches) are all computing the same thing and relying on the same theory of information. Of course, they often compute information in different ways, and have different strengths, weaknesses, and assumptions. But we can demonstrate that all these things actually map to one another.
It turns out that all the types of information discussed in ID can be reduced to either “entropy” or “mutual information”. Turns out all the theories above have analogues to these two measures (under specific assumptions) also. This coherence is one of the most beautiful things of information theory. If you really understand one domain, and you learn how to map between different domains, you immediately understand something salient about all of them.
A great example of this is the theoretical result (excuse the technical language) that (approximately) the minimum compression size of a message = minimum description length = IID shannon entropy of the compressed message; moreover, the correct compression will produce a string of bits that is (1) uncorrelated and (2) 50% 1s and 0s, and therefore indistinguishable from random. Another surprising result is that the highest information content entity is noise, exactly the opposite of our intuition of what “information” actually is.
Another very important result is that the true entropy of a sequence (its true information content independent of assumptions like IID) is uncomputable. The proof is esoteric, but the intuitive explanation is that we can never be sure that we did not see a hidden pattern in the data that would allow us to “compress” or “understand” it better. We can never be sure that what looks like noise to us actually is noise. Equivalently, we can never be sure what looks like to information to us, actually is information.
One might wonder, if all these theories map to one another, why we keep them all. Each theory comes with its own assumptions, ability (or lack thereof) to represent specific types of data, and ultimate goals. Moreover, the uncomputability results pragmatically guarantees that there is no single solution that will work in all domains. So, even though these are all connected by a common information theory, there is reason to maintain and develop each one. There is even value in cross-pollinating between them, as has happened several times in the history of these fields.
And now by the ID movement, they have gone one to define a few more terms, that turn out to be exactly mapped to prior concepts.
- Functional Sequence Complexity http://www.biorxiv.org/content/early/2017/03/06/114132
- Algorithmic Specified Complexity http://robertmarks.org/REPRINTS/2014_AlgorithmicSpecifiedComplexity.pdf
FSC, for example, maps directly to “Conditional Mutual Information”. It is exactly the same formula. ASC, is just mutual information with (usually) a uniform distribution computed with kolmogorov complexity, but computed it in a way that is guaranteed to overestimate information content (the proof here is instructive). A detailed discussion of these is probably beyond the scope, but there are clear demonstrable mathematical errors in both the calculation and interpretation of these quantities that become obvious when they are mapped back to standard information theory.
Usually, these error are made:
-
It is mentioned then forgotten that random noise is the most potent type of information there is. It is impossible to compress and always inflates measures of information. This is the insight of machine learning, we look for “lossy compressions” of the data that can reproduce its patterns minus the noise. This denoising activity is one way to frame, for example, summarizing data with a fit line. So, for example, in FSC it is forgotten that correlated noise (which creates a signature for common descent) dramatically increases FSC calculations. The same is true for ASC.
-
It is forgotten that “semantic” information (which is very poorly defined) is characterized by very low total information content (though it is very dense). Equivalently, semantic information is only a tiny fraction of the total information in a piece of data. In this way, semantic information is like a very useful and very lossy compression. For example, I can mention “Mount Rushmore” and you will picture the carved presidents on a cliff, demonstrating just a few bits of information conveys the semantic information in a picture of Mount Rushmore that might be impossible to compress less then several MB without losing “information”. So if we compute information in some context, and it is high, that is almost a guarantee that what we are measuring is not actually semantic at all. The most likely and frequent reason for high information measurements is noise. The most likely and frequent reason for high mutual information measurements is correlated (e.g. shared) noise. In contrast, semantic information is almost defined (except it has no precise definition) as a useful but dramatically effective and lossy compression of data. So semantic information itself is information dense and useful but also has low total information content (because there is not much of it), and it throws most of the information in the data.
-
Usually when considering biology, these algorithms start with the assumption that information should be measured against a uniform distribution. This is a very strange assumption because (as far as I know) there is almost no system in all of nature that is perfectly uniform (quantum noise might the only example). There are always patterns in the data, these patterns are the signature of mechanisms, and each of these patterns drives the information content of the system (with respect to a uniform distribution) upwards. Of course in ID, mechanism is never considered as an option (even though this increases measures like ASC and FSC because they are analogues of mutual information). Instead any mutual information is asserted to be the “unique product of a mind.”
-
It is assumed that information content can be reliably computed. It cannot, because remember it is “uncomputable”. A great example of the pervasive and false claim that DNA is incompressible. I turns out that it is compressible, but you need a special algorithm (zip and bzip will not do) to compress it efficiently. Ironically, this better compression algorithm is based on evolutionary theory. Not knowing this, one would assume that DNA is much higher functional/semantic information content than it actually has.
The fact that they are renaming and rederiving old ideas (like mutual information and conditional entropy) makes all this less obvious. One has to know information theory very well to recognize the formulas. None of the formulas the present, actually, are new. They are all just relabeling old formulas with new names makes it harder to connect to all the prior work done in this area. That is helpful to their case, because it becomes less obvious how the formulas are misapplied and where errors arise.
On example of an egregious error is the claim that ASC is a lower bound, and therefore underestimates information content. http://robertmarks.org/REPRINTS/2014_AlgorithmicSpecifiedComplexity.pdf This claim is false, but it is hard to see why because it is a “mash up” formula defined as shannon entropy (which assumes uniform markov processes) minus komologorov complexity (which is an upperbound). This sort of mixing of theories can be okay, but by not carefully tracking the underlying assumptions for each “side of the minus”, Marks incorrectly concludes that this is a lower bound. Ironically, the paper I linked to by Marks actually produces an example (section 3.2) that demonstrates that his claim that ASC is a lower bound is false, and that it identifies intelligence is false too.
Any how, I hope this is understandable and interesting to people. Curious to hear peoples thoughts and questions.