DNA is not like human language or code

Swamidass · March 17, 2017, 6:05pm

Yes, DNA is also LIKE human language and code. But there are many more differences than similarities.

Continuing the discussion from Information = Entropy:

This is similar to the claim that DNA is a language.

I manage a team of scientist writing computer code and have been a programmer for 20 years. I deal with biology all the time. I know how language, computer code, and DNA works. These things seem almost entirely different to me.

Yes, I do see a very very weak analogy. This analogy is sometimes useful for teaching students some of the detail. Only sometimes. However it is a very limited way of understanding biology. Fixating on these small similarities seems to just confuse understanding of how biology works.

I am curious if anyone can explain what they think I am missing here. In addition to listing off similarities between DNA and language (which I agree do exist), explain what you know to be the differences and why you think the differences are not important.

Swamidass · March 17, 2017, 7:04pm

And I am curious if anyone can answer this final question too…

Information = Entropy

DNA is certainly not designed to communicate with human scientists. We cannot even process short stretches of it in our brain, and have to rely on computer software at every step to even to begin to think about it.

So, here is the real question, the real task we are faced with. I can give give you two sequences:

a sequence of DNA that is totally random (and I will not tell you how I generated it)
a sequence of DNA the same length encodes a biologically important function.

We can extend this further. I can give you as many pairs of examples as you like (thought not infinite =).

Please tell me how you will determine the amount of information in each of these DNA sequences? Do you expect to be able to easily tell the difference between the two? How would you quantify the the amount of information in them? Can you quantify the amount of order? Would you be able to determine which one was random vs. functional? What mathematical formula or algorithm would you use?

My point is that it trivial for me to give you sequences where:

Neither you nor experts in the field would be able tell the difference between the random and functional sequences.
Neither could anyone write a piece of software that could tell the difference.
Neither could anyone design a biological experiment to discriminate the two sequences.
Neither could anyone even compute the true entropy of these sequences. Because I generated the random sequence, I will be able to tell you the true entropy.
Neither can anyone even compute the true information content of these sequences. Because I know which one encodes the biological function and which one is random, you would have to give me different numbers for each sequence.

If you think I am wrong, you can always take me up on the challenge. We can see how far you get.

This is the core of the problem. If you cannot understand the data entirely, you have absolutely no way to confidently f answer the important questions about it. Applying a formula to it or qualitatively reasoning about it gets you no where of consequence. You might as well just be staring at static on a TV screen. The fact that it looks like static to you tells you absolutely nothing about what it really is.

If this is true, and it is. What exactly is the information theory argument for Intelligent Design?

gbrooks9 · March 18, 2017, 9:55pm

I would suggest that a big difference between DNA and language is that DNA is not arbitrary like language.

A string of DNA can specifically correspond to a protein; changing the DNA can make it impossible to produce the protein.

Conversely, words and the spelling of words are all rather arbitrary. Spelling and syntax can be fiddled with quite a bit … and the words or phrase or sentence could still be essentially understandable.

Swamidass · March 18, 2017, 10:11pm

I would put it differently, saying their are different constraints operating on protein-coding DNA and human language.

The function of a specific DNA is to produce a protein with a specific function, its constraints are not entirely contextual but are defined by physics in addition to the the biological system in which it exists.

In contrast…

The meaning of words in a language is entirely defined by the man-made conventions and semantics applied to them, which can be changed at will. However, there is some limitations given the nature of human minds; for example most words will be pronounceable and easy to visually parse with no more than several syllables.

Also one way linguistics misleads us about DNA is that language is much much more fragile than DNA. Both can tolerate change, and both cannot tolerate all change. But DNA is much more robust than english text.

gbrooks9 · March 19, 2017, 12:31am

Certainly, DNA can effectively build a protein whether or not anyone understands how to interpret DNA.

Mervin_Bitikofer · March 19, 2017, 12:26pm

I’ll push back a little here. Not that my understanding of DNA is comparable to Joshua’s. But I do have a bit more understanding of English language and can see that we all throw in sentence fragments (errors!) sometimes deliberately. Joshua (and most of us) occassionaly misspell things or use the wrong “their” or “there”. These are all errors of communication and yet they do nut prvnt us frm being able to dissern the meening we intent to communictate. So there are some who might find some confusions in my poor writing above – perhaps it slows you down some as you read it. But if you could understand my intended meanings above, then that is testimony to the robust nature of textual communication. If enough intelligent people like Joshua keep interchanging “there” and “their”, then our textual conventions may change in a generation or so and those “mutations” will become the new convention.

So it seems to me that language might have much more tolerance than computer code (I know – you seem to be trying to wean us of that analogy, Joshua, but bear with me here.) A one bit change in machine language is likely to cause a crash. A one-character misspelling in a paragraph barely even slows us down. I hear you arguing that DNA is also very error-tolerant, which I have no reason to doubt. And that would leave computer code as the odd one out of these three – by far the least error-tolerant which highlights a difference. But isn’t that just a difference of information density? Computer code is very information dense. If it is written efficiently there is very little in it that is nonessential or unimportant. DNA apparently may have vast swaths of redundancy and concentrated areas that code for crucial proteins. Would it be true to say that a perpetuated single codon error in the middle of a crucial protein (I barely know what I’m talking about here) could easily prove fatal or at least severely detrimental to the organism? Would computer code be a bit more like DNA if the vast proportions of the code were in mere programmer remarks or other “less essential” functions? Then errors of transmission in that code are more likely to be found among its less crucial bits.

The other major difference (between textual communication and DNA) is that we have agreed upon conventions (that can change!) for our text. We make dictionaries to try to slow down that change and promote consistency for the sake of clarity at least within the scale of a human lifetime. Whereas DNA, so far as we know does not involve any will to communicate.

Similar to your DNA challenge, I could also pick out some string of “random digits” from somewhere in pi and defy anyone to distinguish it from a different (but same-sized) string of randomly generated digits. While the pi digits would not be truly random, the situation is blurred in that I could generate my “random string” first, and then proceed to find that exact same string somewhere in pi due to its infinite search space. I realize that while the digits themselves are then considered less random, it causes the location I was forced to find in it to become the random aspect. Still, all this show the muddled state of waters for those who want a fail-safe algorithm to distinguish true noise from true information. I do continue to think that such a task is impossible.

[A first self-corrective edit already added to my ‘DNA’ above!]

glipsnort · March 19, 2017, 3:07pm

I agree that natural language is actually very error-tolerant, and that computer code is the outlier. I think it’s more than just information density, though. Compilers can often guess what mistake I made in coding, which means there’s sufficient redundancy for my error to be tolerated. They’re simply not designed for that kind of tolerance. In fact, they’re often designed to be highly intolerant of coding errors. That’s because a program that sort of does more or less what the writer expected is not the goal; the goal is a system that precisely and predictably carries out instructions.

It has long struck me that, to the extent that DNA does resemble language or computer code, it most resembles those features that are not intelligently designed. Vestigial functions, multiple ways of implementing the same operation, needless complexity, branching evolution and development of new functions for originally identical systems – the kind of features that appear in natural languages and in badly designed computer code.

Swamidass · March 19, 2017, 3:53pm

Yes language is tolerant to error.

And yes, computer code is much more brittle.

But the real question is where DNA is compared to human language and human code. I will assert that it is much less brittle than human language still. So my original point still stands. DNA is much more error tolerant than human text, and much less regular.

This is one reason that gzip compression works much better on human text (even with misspellings) than it does on DNA. There is a much tighter dependency structure in text than in DNA.

Also there is a difference between language and DNA. Language encodes semantic information (which ultimately makes it much more compressible), but DNA and computer code encode “functions”. DNA, also, is not intended to convey meaning. So DNA is more like code (if we are to pick one of the two). So the fact that human code is so brittle, and DNA is not should be a very strong clue that these are very different things.

Your sentiment is correct, but your use of “Information” is not. DNA is much less dense in information important for function than human code. But because much of the information outside the functional area is noise, it is actually very information dense. That is why it is so hard for gzip to compress it.

And even in the functional area of DNA this is not true. Even in these places, there is very high tolerance to error. You can go look for yourself the amount of variation we see in coding regions of the genome that has no apparent effect (http://exac.broadinstitute.org/).

Notice that most of these “codon errors” are not comparable to a misspelling in human text. It is more equivalent to an “alternative” spelling, because it has no discernible effect on function. In DNA we find the alternative spellings are uncountably large, where in human language there are much much fewer.

beaglelady · March 19, 2017, 4:38pm

A pox on the dreaded “=” vs “==” in php, C++, etc.!

system · March 25, 2017, 10:38pm

This topic was automatically closed 6 days after the last reply. New replies are no longer allowed.