Information = Entropy

Swamidass · March 12, 2017, 5:48pm

As a quick introduction, my field of study is computational biology, with specific expertise in applying information theory to biology. My PhD is specifically in “Information and Computer Science”, with emphasis (in my case) on information. I graduated in 2007 from the PhD and have been working full time here since 2009.

One of the most surprising things for me over the last several years is the shift in anti-evolution arguments towards “information theory” arguments. Here is where great confusion abounds. And I thought I would give people a quick map of the situation.

To kick this off, I think this particular conversation is instructive…

Continuing the discussion from What is the Evidence for Evolution?:

I find this to be a very entertaining and instructive exchange. First off, information actually is exactly entropy. If we correctly compute information, we have just exactly computed the entropy. If we can exactly compute entropy, we have computed the information. This is the most important and fundamental result of Information Theory, laid out in the classic paper by Shannon: http://math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf.
So ENTROPY = INFORMATION.

Of course, this is in Information Theory, not Thermodynamics, that this result was first made. So it is no surprise that our friend here with a 1975 thermodynamics PhD is confused. So we find that the 2nd law of thermodynamics actually does (nearly) guarantee that information increases with time. I can explain the intuition of this to the curious, but it actually makes 100% sense when you understand what information is.

Now the very interesting thing about information (and entropy) is that has historically been derived several different ways.

Thermodynamics (the first derivation of entropy)
Shannon Information (usually computed assuming IID)
Kolmogorov Complexity
Minimum Description Length
Compression (e.g. integer coding)
Machine Learning/Model Building (a type of lossy compression)
Auto-Encoding/Dimensionality Reduction//Embeddings
Dynamical systems and Equation fitting

NOTE TO THE CONFUSED: This discussion can get confusing because of the many definitions of “information” in common speech, and also that there are two main types of information that work in different ways: (1) entropy or information content and (2) mutual information or the shared information. Information content is the amount of information in a single entity (and is measured as the entropy). Mutual information is the amount of information shared by multiple entities (and is a measure of commonality, and is equal to the difference between two entropies). When communicating with the public, it is hard to keep these two different types of information straight without devolving into dense technical language. But if you detect a contradiction in what I wrote, this is probably the reason. Because these two types of information behave similarly in some cases, and exactly opposite in others.

One of the really interesting things that happened over the last several decades is the discovery that all these different types of information (or approaches) are all computing the same thing and relying on the same theory of information. Of course, they often compute information in different ways, and have different strengths, weaknesses, and assumptions. But we can demonstrate that all these things actually map to one another.

It turns out that all the types of information discussed in ID can be reduced to either “entropy” or “mutual information”. Turns out all the theories above have analogues to these two measures (under specific assumptions) also. This coherence is one of the most beautiful things of information theory. If you really understand one domain, and you learn how to map between different domains, you immediately understand something salient about all of them.

A great example of this is the theoretical result (excuse the technical language) that (approximately) the minimum compression size of a message = minimum description length = IID shannon entropy of the compressed message; moreover, the correct compression will produce a string of bits that is (1) uncorrelated and (2) 50% 1s and 0s, and therefore indistinguishable from random. Another surprising result is that the highest information content entity is noise, exactly the opposite of our intuition of what “information” actually is.

Another very important result is that the true entropy of a sequence (its true information content independent of assumptions like IID) is uncomputable. The proof is esoteric, but the intuitive explanation is that we can never be sure that we did not see a hidden pattern in the data that would allow us to “compress” or “understand” it better. We can never be sure that what looks like noise to us actually is noise. Equivalently, we can never be sure what looks like to information to us, actually is information.

One might wonder, if all these theories map to one another, why we keep them all. Each theory comes with its own assumptions, ability (or lack thereof) to represent specific types of data, and ultimate goals. Moreover, the uncomputability results pragmatically guarantees that there is no single solution that will work in all domains. So, even though these are all connected by a common information theory, there is reason to maintain and develop each one. There is even value in cross-pollinating between them, as has happened several times in the history of these fields.

And now by the ID movement, they have gone one to define a few more terms, that turn out to be exactly mapped to prior concepts.

Functional Sequence Complexity http://www.biorxiv.org/content/early/2017/03/06/114132
Algorithmic Specified Complexity http://robertmarks.org/REPRINTS/2014_AlgorithmicSpecifiedComplexity.pdf

FSC, for example, maps directly to “Conditional Mutual Information”. It is exactly the same formula. ASC, is just mutual information with (usually) a uniform distribution computed with kolmogorov complexity, but computed it in a way that is guaranteed to overestimate information content (the proof here is instructive). A detailed discussion of these is probably beyond the scope, but there are clear demonstrable mathematical errors in both the calculation and interpretation of these quantities that become obvious when they are mapped back to standard information theory.

Usually, these error are made:

It is mentioned then forgotten that random noise is the most potent type of information there is. It is impossible to compress and always inflates measures of information. This is the insight of machine learning, we look for “lossy compressions” of the data that can reproduce its patterns minus the noise. This denoising activity is one way to frame, for example, summarizing data with a fit line. So, for example, in FSC it is forgotten that correlated noise (which creates a signature for common descent) dramatically increases FSC calculations. The same is true for ASC.
It is forgotten that “semantic” information (which is very poorly defined) is characterized by very low total information content (though it is very dense). Equivalently, semantic information is only a tiny fraction of the total information in a piece of data. In this way, semantic information is like a very useful and very lossy compression. For example, I can mention “Mount Rushmore” and you will picture the carved presidents on a cliff, demonstrating just a few bits of information conveys the semantic information in a picture of Mount Rushmore that might be impossible to compress less then several MB without losing “information”. So if we compute information in some context, and it is high, that is almost a guarantee that what we are measuring is not actually semantic at all. The most likely and frequent reason for high information measurements is noise. The most likely and frequent reason for high mutual information measurements is correlated (e.g. shared) noise. In contrast, semantic information is almost defined (except it has no precise definition) as a useful but dramatically effective and lossy compression of data. So semantic information itself is information dense and useful but also has low total information content (because there is not much of it), and it throws most of the information in the data.
Usually when considering biology, these algorithms start with the assumption that information should be measured against a uniform distribution. This is a very strange assumption because (as far as I know) there is almost no system in all of nature that is perfectly uniform (quantum noise might the only example). There are always patterns in the data, these patterns are the signature of mechanisms, and each of these patterns drives the information content of the system (with respect to a uniform distribution) upwards. Of course in ID, mechanism is never considered as an option (even though this increases measures like ASC and FSC because they are analogues of mutual information). Instead any mutual information is asserted to be the “unique product of a mind.”
It is assumed that information content can be reliably computed. It cannot, because remember it is “uncomputable”. A great example of the pervasive and false claim that DNA is incompressible. I turns out that it is compressible, but you need a special algorithm (zip and bzip will not do) to compress it efficiently. Ironically, this better compression algorithm is based on evolutionary theory. Not knowing this, one would assume that DNA is much higher functional/semantic information content than it actually has.

The fact that they are renaming and rederiving old ideas (like mutual information and conditional entropy) makes all this less obvious. One has to know information theory very well to recognize the formulas. None of the formulas the present, actually, are new. They are all just relabeling old formulas with new names makes it harder to connect to all the prior work done in this area. That is helpful to their case, because it becomes less obvious how the formulas are misapplied and where errors arise.

On example of an egregious error is the claim that ASC is a lower bound, and therefore underestimates information content. http://robertmarks.org/REPRINTS/2014_AlgorithmicSpecifiedComplexity.pdf This claim is false, but it is hard to see why because it is a “mash up” formula defined as shannon entropy (which assumes uniform markov processes) minus komologorov complexity (which is an upperbound). This sort of mixing of theories can be okay, but by not carefully tracking the underlying assumptions for each “side of the minus”, Marks incorrectly concludes that this is a lower bound. Ironically, the paper I linked to by Marks actually produces an example (section 3.2) that demonstrates that his claim that ASC is a lower bound is false, and that it identifies intelligence is false too.

Any how, I hope this is understandable and interesting to people. Curious to hear peoples thoughts and questions.

gbrooks9 · March 12, 2017, 6:14pm

@Swamidass

Now that was an important education! Excellent one post primer !!!

Lynn_Munter · March 12, 2017, 6:24pm

This is really fascinating and interesting! I wish I had some good questions for you, but I’m still just admiring it.

I’ve always had a feeling that the information arguments were particularly weak, but just at the level of exclaiming, “Of course mutations can add information to DNA!” and saying how. I could not have tied the argument to the other false arguments about thermodynamics.

I wonder if they did rederive the formulas themselves, or just renamed ones they found?

Swamidass · March 12, 2017, 9:00pm

Reading their work and having talked to a few of them, I think that they genuinely believe the have solved a fundamental problems in information theory. I think they genuinely believe they have proven evolution impossible. I do not think they are lying. They may be wrong, but they do seem to believe what they are selling.

Often, they take the existing work out there, and combine pieces from different frameworks (ASC is a great example of this, which combines Kolmogorov and Shannon frameworks). In principle, this can be okay, but only if the limitations and assumptions are tracked and carefully handled, or you end up with incorrect proofs. And that is the problem, the proofs of key claims end up being wrong, but in a way that it takes an expert (or the right counter example) to detect. So I am sympathetic to them, in that they probably do not see the errors in their own proofs. But these are subtle errors that takes real expertise to detect.

In particular, the proof that Kolmogorov complexity is uncomputable is important. This means that all information calculations from data are tentative. For this reason, information cannot be used to falsify a physical mechanism or detect intelligence in any context I know of, because (as I’ve explained) unexplained mechanisms (including intelligence) produces exactly the same signature as noise.

This does not mean information theory is useless. We use it in science is to quantify the amount of information a given model explains about the data. The rest of the information is either noise (very common) or additional mechanisms not yet modeled or understood. There is no way to tell from information theory (without perfect knowledge) how much of this extra information (beyond that which we can explain) is noise, a new but unknown mechanism, or the input of an outside intelligence. There is no way to mathematically separate these things in an automated or purely mathematical way.

Our only strategy in science is to build additional models that can better explain the data, which we can do by better modeling known mechanisms or modeling newly discovered mechanisms. At the end we are always (in biology) left with additional unexplained information. But we never know how much of this is noise, new mechanisms or intelligence. And the fundamental problem is that biology is full of noise. That is why science cannot rule out God’s direct action in evolution, but neither can information prove intelligence was necessary.

Jon_Garvey · March 13, 2017, 9:09am

Joshua

As you say, the various definitions of information (with their associated theoretical backgrounds) makes for confusion at any non-technical level, where one cannot rely on operational definitions applying. The fact that semantic information is the least definable type is the most confusing factor of all.

So a “lay” reading of your thread strapline, that information, as Shannon entropy, will almost inevitably increase over time, (combined with “lay” befuddlement over the ensuing detailed discussion) is going to be understood by many as implying that LUCA inevitably leads to man from information theory. But as you write:

Another surprising result is that the highest information content entity is noise, exactly the opposite of our intuition of what “information” actually is.

That is a key point. Wikipedia’s article helpfully expands the complementary aspect of this:

English text, treated as a string of characters, has fairly low entropy, i.e., is fairly predictable. Even if we do not know exactly what is going to come next, we can be fairly certain that, for example, that ‘e’ will be far more common than ‘z’, that the combination ‘qu’ will be much more common than any other combination with a ‘q’ in it, and that the combination ‘th’ will be more common than ‘z’, ‘q’, or ‘qu’. After the first few letters one can often guess the rest of the word. English text has between 0.6 and 1.3 shannons of entropy for each character of message.

So an English text has less information than a random string. The English text is “organised” according to its semantic content (“meaning”) and so is moderately compressible (as opposed to being “ordered” in a patterned, highly compressible, way), which is why it is the relative lack of (Shannon) information that renders it intelligible.

I take it that this organisation is analogous to what is of most interest in DNA, ie its ability to code for functioning “forms most beautiful”. The organisation of DNA is what makes it useful to the organism, which correlates with a loss of information compared to a random string, or a gain of information compared to a crystalline pattern - but something much less measurable when comparing (say) LUCA to man. It would seem, then, that the interesting function of DNA is largely orthogonal to formal information theory.

Our English text, copied and corrupted over time, will indeed inevitably increase in information content in the Shannon sense, but only as it inversely loses semantic meaning. Likewise, would not a string of DNA reconstituted randomly from a homogenised genome contain the maximal amount of “genetic information”, and be absolutely functionless?

Therefore, whilst it may be right to point to confusion over terms in discussions by the ID people, one should also be critical of those calling a gene duplication or someother neutral mutation an “increase in genetic information” if, by that, all one means is that there is an increase in randomness. It leaves the underlying question of “How come so much organisation?” unaddressed, it seems to me, whilst conveying the impression, through the ambiguity of the word “information”, that biological organisation has been explained.

Chris_Falter · March 13, 2017, 3:15pm

This, I think, is the key point in the discussion of evolution and intelligent design. If it is inherently impossible to distinguish between noise, unknown mechanisms, and outside intelligence, then there can be no “signature in the cell.”

Here’s how I understand the statement, please correct me if I’m wrong. Shared snyteny in a pseudogene is a signature of common descent. At the same time, it also increases the functional sequence complexity of the genomes under study.

Some ID theorists claim that functional sequence complexity is evidence of intelligent design–i.e., they claim it cannot be crated by stochastic mechanisms. However, we see that FSC increases in the presence of pseudogenes, so the claim is dubious.

This surprising insight belies the claim that we humans can detect the injection of information by an intelligent designer. What looks like high information content is indistinguishable from random noise.

You’ve given us a lot to chew on, Joshua. Thanks for the fascinating essay!

Swamidass · March 13, 2017, 9:35pm

Regarding FSC, that is wrong. Sorry.

FSC computes the mutual information in a sequence alignment of proteins that all have the same function. This measures the commonality between the sequences. It is asserted that all this information is there exclusively because of shared function. Supposedly, high amounts of FSC are the unique product of minds.

Turns out, that common ancestry produces a very strong signal for information of sequences like this too (it is a source of shared noise). So does any realistic mutational mechanism. So does positive selection. None of these information generating mechanisms are considered in FSC. So it is a dramatic overestimate of the amount of information in proteins caused by shared function.

Remarkably, they do not even consider the possibility there are other sources of information (like noise), let alone make a coherent case that it does not apply. The claim that information is the unique product of a mind is not only dubious, it is
false.

Exactly. This essentially is a sophisticated, mathematical version of “intelligence of the gaps”. It is nothing more.

Exactly.

To clearly understand this, it is important to recognize that “shared information” (i.e. mutual information or the information in common) can also be caused by noise too, by shared noise. That is the most common reason for high mutual information in scientific data.

Shared noise, for the record, is exactly how we infer common descent too~! For this reason, a large proportion of the “Information” these ID methods compute is actually explained by common descent.

Semantic is the least definable because it is subjective, because in different domains the salient meaning of data is different. There is a way to model this. There is actually a whole subfield of informatics that does “knowledge engineering” and “semantic modeling.”

One example you may be familiar with is ICD9 and ICD10 diagnosis codes in medicine. They summarize key semantic information about a patient into a very dense and parsable format. This format is a very very good “compression” of a patient’s records for some purposes. A few bits of information about each patient enables reimbursement, surveillance, quality metrics, and more.

So one way to think of semantic information is a that it is a reduction of the data down to just that which is needed for a specific task. Of course, there are many specific tasks, so each task will produce a different amount of semantic information.

Another way to think of semantic information is a dictionary look up, where we just story a pointer to a larger concept. This is a little how language works. Semantically, it is legitimate to semantically compress “all of the cat photos on the internet” with exactly that few bits of information in the phrase. Likewise, the phrase “all functional proteins” is a semantic compression of that concept too.

This of course starts to make obvious that there actually is not any semantic information that is germane to DNA. Remember semantic means “meaning”, but DNA does not convey meaning. Rather even semantic vocabularies of DNA (like GO annotations) are just descriptions of human knowledge about DNA.

Regardless, because there are a large number of semantic tasks one can do with any type of data, there is no clear way to define semantic information in a global way. What we do know, however, is that in almost all cases (I cannot even think of a counter example), semantic information is a very very very dramatic and lossy compression of the starting data. The reason why is that the vast majority of the information in any piece of data is irrelevant to meaning.

This of course is not what I am stating. Information tells us nothing about the accrual of new functions or abilities. This is not a proof that “evolution is easy”. Rather it is a definitive statement that the information based arguments against evolution are wrong. The high “information” content of DNA is explained by primarily by noise and shared noise (primarily), as well as mutational mechanisms, selection, and more.

Science has no way of telling how much of the remaining information is external intelligence. So there is certainly space for believing in God’s action is possible here, but there is no justification for concluding this directly from the data.

A lot of this is not really correct. Sorry.

English text is moderately compressible. However, the semantic information in English text is not moderately compressible. It is highly compressible. What is the difference? Inconsequential spelling, vocabulary, and grammar differences. So a good part of the data in text is noise too, but because of the restrictions of english grammar, the compressibility is increased. All these properties (and their relative contributions) are quirks of the english language and do not necessarily transfer to other languages, let alone totally different domains like DNA.

And text does have very high Shannon information content, which usually assumes a IID generative model, so it cannot take into account the dependency structure of english text. This is actually a very important point. Shannon information for real data will usually be much higher than a smart Komolgrov (compression) estimate of information. As we will see that is actually the fact that breaks the proofs for ASC.

The best way to get a sense of all this is to actually write up the code and study data.

Not true. If we implement semantic selection, rejecting text that is no longer correctly understandable, information will increase indefinitely, and semantic information will stay constant.

I disagree with this. And it is totally valid to state that duplications increase information, because they do.

This is actually the heart of the problem. Let me know show you why.

Information is any change to the data by definition. The key challenge is quantifying this. Lets consider you case of a duplication. A duplication doubles the shannon information (SI) of the sequence. Bu, it only increase the Kolmogorov complexity (KC) a tiny amount. But KC in this case is a good measure of information content.

But let’s consider Algorithmic Specified Complexity (ASC), to see what a duplication does to ASC information content. Remember, Marks proposes this is as a reliable way of identifying intelligence. Let me step you through this.

ASC is defined as SI minus KC.
Remember SI doubles with a duplication
Remember KC stays about the same.
SI and KI are always positive numbers
That means ASC more than doubles with a duplication.

And we are to believe that ASC can reliably detect intelligence? And we are expected to concede that duplications do not increase information, when it more than doubles information by an ID metric for information? Remember duplications do increase both SI and KC information too, just by different amounts.

Really, this is an example of why ASC is a horrible way to measure information content for real data. Rather, it is actually just measuring compression efficiency, that’s it. It is also an example of how thinking about “information” generically is really an unhelpful construct. We need to more carefully make our claims by specifically explaining the type of information measure we are using (remember, they all fail in some contexts).

Even more remarkably, in the same breath that duplication is claimed to be “no information of consequence”, people will point to “low complexity regions” as evidence design. What are low complexity regions? Just long stretches of short duplications with small changes here and there. They are highly variable and hard for most bioinformatics algorithms to handle (because they are low information!!!). So we are supposed to accept that duplications do not produce new information (when they do) and that highly repetitive sequences are high in functional information (when they are just low in information period). it seems like some people either (1) have no idea what they are talking about or (2) are being very opportunistic and inconsistent of their way of ascertaining important information.

This is almost true. It turns out that information theory enables us to reconstruct the history of sequences. And this history is actually very well correlated with interesting function in many cases. There is still noise, but this is our best way of understanding DNA.

I could have added phylogenetics as an extension of information theory too. It is exactly that.

Swamidass · March 13, 2017, 9:58pm

I would add, the same is true with FSC and Complex Specified Information (CSI).

This is really important too. Because these measurements on DNA are presented as evidence of intelligence. The fact of the matter is that information measures do not follow our intuition. And when computations using these measure are presented as evidence, unexpected (to non-experts) things can happen.

So yes, in the context of the debate over ID, duplications do increase information, sometimes substantially (depending on the metric).

Swamidass · March 13, 2017, 10:21pm

There is one possible way forward for ID. They could suggest Design Principles (DP) that (1) explain large amounts of currently unexplained information in the data (quantitatively of course), and (2) cannot be mapped to any plausible physical mechanism. If such DP could be found, that might be interesting. Knowing the data first hand however, so much information is explained by common descent, it is hard to imagine if this is possible.

Chris_Falter · March 14, 2017, 3:36am

Please don’t apologize! I welcome the correction, because I want to understand this subject area better. In fact, can you recommend a resource or two that would help me (and anyone else so inclined) to delve deeper into the relationship between information theory and biology?

Thanks!

Jon_Garvey · March 14, 2017, 9:25am

Thanks, Joshua, for a very clear exposition of a complex subject. Much food for contemplation.

You actually re-emphasize the idea I’ve held for a while that intentional design is formally indistinguishable from Epicurean chance, by demolishing some of the qualifiers I’ve tended to add to that from information theory.

That indistinguishability itself is a deep mystery crying out for an explanation, in that it’s hard to think of two more opposed concepts than chance and choice. It leads me to wonder if the best explanation might be that they are both actually the same thing, ie that one or the other doesn’t exist. That would imply that either no design exists in the universe at any level and all is Epicurean chance, or that no Epicurean chance exists in the universe and all is, ultimately, design.

That would take us back to the fundamental metaphysical divide between naturalism and theism. Either one might be true, but both cannot be.

Swamidass · March 14, 2017, 3:20pm

That isn’t very reformed of you! =)

Why just limit design to the unexplained information of the data? I think God designed the whole thing. He providentially governs all things. I know this from Scripture though. Information theory tells nothing about this.

And do remember that it is possible that intelligence might be recognizable in data…

No one yet has managed to successfully do this, though I do commend those who try.

Is it though?

This is just a restatement of the simple truth that God’s action is hidden in most things, and humans cannot understand the mind of God unless He reveal Himself. The “indistinguishably” logically follows from just two things: (1) we do not fully understand the world and (2) we do not fully understand God. If those things are true, it follows God is often hidden.

Jon_Garvey · March 14, 2017, 3:57pm

On the contary, Joshua, in my experience it’s Reformed people who have least trouble with accommodating that kind idea without threat, barring a few Catholics and Orthodox of my acquaintance!

Swamidass · March 14, 2017, 4:04pm

Well whatever “chance” and “choice” are, I know they are both governed by God’s providence. Thanks for joining in the conversation here. I know this is a technically dense topic.

Jon_Garvey · March 14, 2017, 6:04pm

I was going to leave your post unanswered, Joshua, but then I got to thinking about this sentence, and looked into Aquinas to abstract working definitions of “providence” and “divine will” (the nearest thing to divine choice, I suppose). I know that Aquinas’s treatment of these is not far off Reformed understandings. So:

Providence: the ordering by God of all things to their proper, good, ends.

Divine Will: God’s communication of his own goodness to all things to the end that they may participate in that goodness.

That sounds almost the same thing in different words.

On chance, he argues that if events have a cause, that cause is ultimately God’s providential will. If they had no cause in his will, they would be outside God’s providence - but then they couldn’t occur anyway. Seems logical to me!

gbrooks9 · March 16, 2017, 12:07am

@Jon_Garvey, that’s a pretty powerful statement from @Swamidass, wouldn’t you agree?

Can I have a witness here? Maybe @grog ?!? It doesn’t get any more God-centered than that!.. not realistically speaking anyway!

NonlinOrg · March 16, 2017, 4:06am

Skimming through Shannon’s, I don’t see the equality.
Are you saying that a random noise generator creates information? I think not. Information always requires a deciphering key and some redundancy, both of which reduce entropy.

Chris_Falter · March 16, 2017, 4:35am

I think it’s on page 11 of Shannon’s landmark paper, which Joshua @Swamidass cited in his original post.

Joshua, please correct me if I am wrong. Thanks!

gbrooks9 · March 16, 2017, 4:38am

@NonlinOrg

If a random number generator can eventually pick any lock with a finite number of numbers… I think it’s pretty clear that information can be generated unintentionally … and probably with amazing consequences.

@Swamidass’s career is in the area of computation and artificial intelligence… I think you are kidding yourself if think you can refute anything he tells you is the God’s truth.

NonlinOrg · March 16, 2017, 4:55am

Don’t see it, but the graph on that page that peaks perfectly at (0.5, 1) does not account for the key and for redundancy. Shannon takes the ideal case there.

By itself and without other inputs? Not happening.