"Devolution" and gene loss in evolution

agauger · October 20, 2017, 4:39pm

I would think that evolutionary biologists would be looking for functions that increase fitness, i.e. the ability to produce more offspring in a given environment. For bacteria, that is a straightforward test.

Are you saying that a GC content that differs from the surrounding sequence indicates nonrandomness? What about homologies to mobile elements etc. Does that indicate nonrandomness or randomness? And assuming this parental stretch of chimp DNA that becomes the human orphan gene has all these features before expression begins, and simmultaneously has no stops, what are the odds of such a nonrandom sequence happening by chance?

I’d rather not rely on sequence comparisons. First, I am an experimentalist. I’d rather see the process actually work. Second, sequence analysis relies on the assumption that any similarities or differences are due to the process you are attempting to establish. If you start with the question, is promoter capture possible, and then compare sequences with new promoters and the sequences from which they came , you have not established how it happened, merely that there is now a promoter where there was none.

That’s a story. I’d like actual experimental evidence. So many papers rely on sequence comparisons to claim a history of what happened. I’d like to know how hard or easy those steps actually are.

agauger · October 20, 2017, 4:45pm

There is a strong bias against Doug’s conclusions, which is enough to account for his lack of citation. I have not seen his work refuted. Arthur Hunt did not understand the paper. Nonetheless I plan to discuss the other options in a piece I plan to write when other work allows. This includes Neem et al. I think my cost of expression paper will be directly relevant.

sfmatheson · October 20, 2017, 5:00pm

That is inappropriate and, more importantly, false. It substantially reduces your credibility.

It doesn’t need to be refuted, because it is unimportant and irrelevant. I did not claim that it is wrong. It’s just uninformative. And the emphasis on folds is one big reason why. The body of scholarship on the topic of the protein universe has moved far beyond both the emphasis on folds, which is now known to be incorrect, and the narrowness of the 2004 paper (a single enzyme, assessed by looking at a single function). To suggest that the 2004 paper is singularly important is to ignore essentially all work on the very interesting topic of protein evolution and the nature of the protein universe.

Much worse than the over-emphasis on Axe’s old work is the failure to discuss or even mention the state of the art in the field, a paper published just a few months ago that actually does refute many of the main claims that you seem to prefer. This is where that field stands now. A credible discussion of the field could not possibly overlook it:

http://science.sciencemag.org/content/357/6347/168.long

T_aquaticus · October 20, 2017, 5:07pm

I was thinking more along the lines that 0.50 GC content may not reflect the actual nucleotide content in regions where de novo genes are emerging.

"Comparison of GC content in the region surrounding the TSSs clearly revealed that de novo genes are more A/T-rich than conserved annotated genes (S10 Fig). "

“LTR frequency is higher in the region -100 to +100 in de novo genes when compared to conserved genes (Fisher-test p-value < 10−18).”
Ruiz-Orera et al. (2015)

Using a random DNA generator with 50% GC content may not accurately model what is going on in these genomes.[quote=“agauger, post:41, topic:36902”]
I’d rather not rely on sequence comparisons. First, I am an experimentalist. I’d rather see the process actually work. Second, sequence analysis relies on the assumption that any similarities or differences are due to the process you are attempting to establish. If you start with the question, is promoter capture possible, and then compare sequences with new promoters and the sequences from which they came , you have not established how it happened, merely that there is now a promoter where there was none.
[/quote]

How would you rule out random mutation experimentally in these scenarios?

agauger · October 20, 2017, 5:28pm

AG There is a strong bias against Doug’s conclusions, which is enough to account for his lack of citation.
SM That is inappropriate and, more importantly, false. It substantially reduces your credibility.

AG In what world do you live, that you think this is false, or that you think his work is irrelevant? (I’ll ignore the rest.)

SM I did not claim that it is wrong. It’s just uninformative. And the emphasis on folds is one big reason why. …To suggest that the 2004 paper is singularly important is to ignore essentially all work on the very interesting topic of protein evolution and the nature of the protein universe.

AG It’s not irrelevant or uninformative to everybody. And it is most directly relevant to what you asked: “A more valid question is this: what is the probability that a particular random polypeptide will have a function (in a given context)? The one thing we know about this probability is that it is greater than zero.”

On protein origins:
Despite proteins’ profound impacts on life, their
origin is not well understood. What caused a string
of amino acids to start doing something? Or are
strings of amino acids inherently programmed to do
things? These are questions with which researchers
in the protein-origin field have been grappling.
Researchers have a better grasp of the processes
of selection and evolution once a function
appears in a peptide. “Once you have identified an
enzyme that has some weak, promiscuous activity
for your target reaction, it’s fairly clear that, if
you have mutations at random, you can select and
improve this activity by several orders of magnitude,”
says Dan Tawfik at the Weizmann Institute
in Israel. “What we lack is a hypothesis for the
earlier stages, where you don’t have this spectrum
of enzymatic activities, active sites and folds from
which selection can identify starting points. Evolution
has this catch-22: Nothing evolves unless it already
exists.”…
Lupas thinks that function had to precede
structure, because producing a complex structure
is an incredibly hard job. “After 3.5 billion years
of evolution, nature still has a substantial folding
problem,” he states. He points out that, under normal
circumstances, about one-third of a modern cell’s
resources is devoted to protein quality control and
turnover. “We’re not talking about a few proteases
here and there. We’re talking about substantial
resources of the cell just for this routine maintenance,”
says Lupas. “You wouldn’t have to commit
this amount of resources if protein folding was not
problematic.” While Szostak agrees the hypothesis
is elegant, he says there isn’t much experimental
evidence to bear it out.
Szostak says that the origin of protein function
also brings up the question of how many amino
acids were around for making the first proteins.
“There is pretty good evidence that at least some of
the standard 20 amino acids came in late” in evolution,
says Szostak. “Some of the simple, easy-tomake
ones, like glycine and aspartate, were probably
there right from the beginning.” The reduced number
of amino acids plays into the folding issue, because
there may be constraints in folding peptides made
from a smaller number of amino acids.
Overall, what the field of protein evolution needs
are some plausible, solid hypotheses to explain how
random sequences of amino acids turned into the
sophisticated entities that we recognize today as
proteins. Until that happens, the phenomenon of the
rise of proteins will remain, as Tawfik says, “something
like close to a miracle.”

agauger · October 20, 2017, 5:38pm

@T_aquaticus > How would you rule out random mutation experimentally in these scenarios?

My first impulse would be to run a simulation. Specify GC content, random mutation parameters, and see what happens.

sfmatheson · October 20, 2017, 5:52pm

I am a professional scientific editor in regular and deep contact with biologists of all kinds and most especially with biologists studying evolution. In that world, which we may also refer to as the scientific community, the paper is being cited according to its impact and importance. In that world, public accusations like yours (you have made several inflammatory accusations against journals and scientists, which I have found surprising and disappointing) are taken seriously at the outset, because bias (of any sort) is something we all have to look for and fight or at least account for. If you had merely speculated that perhaps the Axe paper was poorly cited because people didn’t believe it, that would be one thing. Instead, I thought you were making an accusation. If in this case you simply mean that people disagree, then I must apologize. It sure looked to me like an inflammatory remark.[quote=“agauger, post:45, topic:36902”]
It’s not irrelevant or uninformative to everybody.
[/quote]

It is irrelevant in light of two things you have deleted that were central to my comment: 1) extensive knowledge on protein folds acquired since the simple experiments Axe reported in 2004; and 2) the fact that folds are not necessary for function.

Until those two facts are acknowledged, your comments on Axe’s paper will be inaccurate.

These quotes from a news piece are not about the protein universe. They are about protein origins. This is not the topic.

Do you think it is important to acknowledge that folds are not the same as function?

agauger · October 20, 2017, 5:58pm

@T_aquaticus > How would you rule out random mutation experimentally in these scenarios?

My first impulse would be to run a simulation. Specify GC content, random mutation parameters, and see what happens.

My husband I did a crude simulation of the rarity of ORFs in different GC content random sequence. Specifically we asked how many sequences 900 bp or longer had overlapping ORFs. We were asking it in the context of nylonase, which has 3 (!) non-stop frames. GC rich sequences do much better at having ORFs at all :

According to the nylonase story, as told by Ohno and Venema and numerous others, a new ATG start codon was formed by the insertion of a T between an A and G, thus creating a new start codon after the original ATG, which shifted the reading frame for that sequence to that specified by the new ATG, and creating a completely different coding sequence and thus a new protein. Let us grant that scenario for the sake of argument. Normally such a shift would produce a new coding sequence that would be interrupted by stop codons, so the newly frameshifted protein would be truncated. Thus the only reason this frameshift hypothesis for nylonase is even remotely possible is because the sequence coding for nylonase is most unusual, and contains not one, not two, but three open frames Although frameshift mutations are ordinarily considered to be quite disruptive, at least in this case the putative brand new protein sequence would not terminate early due to stop codons.

My point? The first step to getting a new functional protein of any length from a frameshift is to avoid stop codons. The odds of a random coding sequence having an open alternate frame, without stops, are poor. As a consequence, if a protein does have an open frame in addition to its coding sequence, it’s worth paying attention to. And it so happens that nylonase does have more than one open frame. The DNA sequence above illustrates the six frames, numbering them frames 1 through 6. Using that convention, frames 1 and 3 are read from the sense strand. Both have no stop codons over the length of the gene in the sense direction. Frame 4 on the antisense direction has no stop codons either. Frame 1 is the coding frame that specifies the nylonase protein, otherwise known as the open reading frame (ORF). It is defined by the presence of both a start and stop codon. The other two frames have no start codons or stop codons, so I’ll call them non-stop frames (NSFs). They are frames 3 and 4.

The probability of a DNA sequence with an ORF on the sense strand and 2 NSFs is very small. Just exactly how small are the chances of avoiding a stop codon in three out of six frames? We set out to determine that by performing a numerical simulation using pseudorandom numbers to generate sequences at various levels of GC content. (By we I mean that my husband, Patrick Achey, who is an actuary, did the programming work, while I determined the parameters.) We chose to vary the GC content because sequences with a higher GC content have fewer stop codons. Remember, a stop codon always has an A and a T (TAA, TAG, and TGA are the stop codons) so having a sequence with a lower percentage of AT content will reduce the frequency of stop codons. Conversely, higher GC content makes the chances of avoiding stop codons and getting longer ORFs much greater, thus also increasing the chances of NSFs. The genomes of bacteria vary in their GC content, from less than 20 percent to as much as 75 percent, though the reason why is not known. One species of Flavobacterium has a genome with about 32 percent GC and 2400 genes — the precise values varies with the strain. The plasmid on which nylB resides is very different. It has 65 percent GC content. The gene encoding nylonase has an even higher 70 percent GC content, which is near the observed bacterial maximum of 75 percent.

We chose to use a target ORF size of 900 nucleotides (or 300 amino acids) because it is an average size for a functional protein. Nylonase is 392 amino acids long; the small domain of beta lactamase, the enzyme my colleague Doug Axe studied, is about 150 amino acids long. The median length for an E. coli protein is 278 amino acids; for humans, the median length is 375.

As expected, the simulation showed that the higher the GC content, the greater the likelihood that ORFs that are 900+ nucleotides long exist. At 50 percent GC, the average ORF length we obtained was about 60 nucleotides; most ORFs terminate well before 900 nucleotides. Indeed, in our simulation only two out of a million random sequences made it to 900 nucleotides before encountering a stop codon. As a result, we could not determine the rarity of NSFs at 50 percent GC — we would probably have to run the simulation for more than a billion trials to get any significant number of NSFs at all.

Sequences at 60 percent GC gave 57 ORFs at least 900 nucleotides long out of a million trials, while sequences at 65 percent GC produced 404 out of a million, one of which also had an NSF.

NSFs were much more probable for sequences that were 70 percent GC, like nylB. In our simulation 3,021 out of a million trials were ORFs at least 900 nucleotides long. That’s a frequency of .3 percent. Of those 3,021 ORFs, 86 had 1 NSF, and none had 2 NSFs. We had to run 10 million trials at 70 percent GC to see any ORFs with 2 NSFs. From those 10 million randomly generated sequences, we obtained 28,603 ORFs; 903 had 1 NSF and only 9 had 2 NSFs.

Interestingly, at 80 percent GC we got a few sequences with 4 NSFs; but I don’t know of any bacterium with a GC content that high.

Our simulation shows that multiple NSFs are very rare. The probability that an ORF 900 nucleotides long with 70 percent GC content will have two NSFs is 9 out of 28,603, or 0.0003. If these figures are recast to include the total number of trials required to get an ORF of that length and GC content and with 2 NSFs, the probability would be 9 out of 10,000,000 trials.

A sequence like nylB is very rare. In fact, I suspect that for all cases where overlapping genes exist, in other words where alternate frames from the same sequence have the potential to code for different proteins, unusual sequence will necessarily be found. Likely it will be high in GC content. Could such rare sequences be accidental? I think that if we compare the expected number of alternate or overlapping NSFs per ORF, with the actual number we will find that there are more of these alternate open reading frames than would be predicted by chance.

From another study of overlapping genes:

Thus, bacterial genomes contain a larger number of long shadow ORFs [ORFs on alternate frames] than expected based on statistical analysis. Random mutational drift would have eliminated the signal long ago, if no selection pressures were stabilizing shadow ORFs. Deviations between the statistical model and bacterial genomes directly call for a functional explanation, since selection is the only force known to stabilize the depletion of stop codons. Most shadow genes have escaped discovery, as they are dismissed as false positives in most genome annotation programs. This is in sharp contrast to many embedded overlapping genes that have been discovered in bacteriophages. Since phages reside in a long term evolutionary equilibrium with the bacterial host genome, we suggest that overlooked shadow genes also exist in bacterial genomes.
Indeed, a study of the pOAD2 plasmid from which nylB came indicates that there are potentially many overlapping genes on that plasmid. nylB′, for example, a homologous gene on the same plasmid that differs by 47 amino acids from nylB, also has 2 NSFs. These unusual and unexpected features of DNA have consequences for how we think about the origin of information in DNA sequences, as I shall discuss in the next post.

agauger · October 20, 2017, 6:08pm

I am not aware of having done this. If I have done so it was not my intent.

I am aware that many proteins are unstructured and still functional, or only adopt a fold in the right context, say upon binding its ligand. I am aware that much knowledge has been gained since 2004. Does any of it refute Doug’s work? He was specifically writing about the prevalence of sequence adopting a stable functional fold, not the entire protein universe (though at the time people thought folding was necessary for function). It still is for the majority of proteins, I believe.

sfmatheson · October 20, 2017, 6:25pm

“Refute” is your word. “Irrelevant” is mine.[quote=“agauger, post:49, topic:36902”]
He was specifically wring about the prevalence of sequence adopting a stable functional fold, not the entire protein universe (though at the time people thought folding was necessary for function).
[/quote]

If he wasn’t writing about the protein universe (and we both know he wasn’t), then he wasn’t writing about the probability of protein function within that protein universe (which is the subject of the conversation). The paper is, then, clearly irrelevant to the subject. To state this is not to refute, or even call into question, any of the data in the paper, nor is it to suggest any intellectual or moral dysfunction on the part of anyone anywhere. But to cite this one paper in a vast corpus of rapidly expanding knowledge, as evidence of anything much at all but much less as evidence of any global fact about the protein universe… well, that’s really hard to understand, Ann.

The plain fact is that the paper tells us next to nothing about proteins in general, and barely more than that about beta-lactamase. In 2004, it was potentially interesting; today it is utterly inconsequential. Those are the facts, revealed by lack of citation, revealed by persistent reluctance to even acknowledge today’s state of the art about protein folds, revealed by reflection on the gratuitous speculation about the protein universe based on observations of mutants of a single protein in 2004.

gbrooks9 · October 20, 2017, 6:35pm

@agauger

I don’t see how you can “enhance” a process by means of DNA/RNA without changing the DNA or RNA. Any change is new information, is it not?

This is one of the reasons I reject this strange kind of math where we decide whether something is “evolving” or “devolving”, when it could be argued that a population is doing both at the same time!

Jay313 · October 20, 2017, 7:32pm

Not to derail the discussion or anything, but the last line of the abstract said, “Our approach achieves the long-standing goal of a tight feedback cycle between computation and experiment and has the potential to transform computational protein design into a data-driven science.” I understand the first part of the sentence, but can you (or someone like you) explain the last part to a layman? I’m not sure whether the author is saying that their approach will lead to breakthroughs in designing new proteins (for commercial or medical use?), or that their approach will allow computational protein design to become a “real” science with “real” data. It probably means something else entirely, but it sounds like a true breakthrough, so I’m curious.

sfmatheson · October 20, 2017, 7:42pm

I think it means both. The magnitude of the breakthrough will be easier to judge in a few years, but I do think it is substantial. You may find this accompanying commentary helpful:
http://science.sciencemag.org/content/357/6347/133

The summary of that piece explains the potential breakthroughs and outcomes:

How does the amino acid sequence of a protein chain determine and maintain its three-dimensional folded state? Answering this question—a key aspect of the protein-folding problem—would help to explain how multiple noncovalent interactions conspire to assemble and stabilize complicated biomolecular structures; to predict protein structure and function from sequence for proteins that cannot be characterized experimentally; and to design new protein structures that do not exist in nature. On page 168 of this issue, Rocklin et al. use parallel protein design on a massive scale to create thousands of miniprotein variants and to determine what sequences specify and stabilize these structures. The work opens up considerable possibilities for protein folding and design.

But there seem to be a lot of remaining tasks that might mean the breakthrough is still around the corner:

In their study, Rocklin et al. have taken high-throughput, data-driven protein design, selection, and optimization to new heights, bringing us closer to solving aspects of the protein-folding problem. A combination of high-throughput studies of the sequence-structure-stability relationships described by Rocklin et al. and drilled-down, fully quantitative examinations of the noncovalent interactions within (mini)proteins will bring us even closer to solving this long-standing problem. In turn, this will facilitate better engineering of natural and de novo proteins.

The lay summary of that paragraph is something like this: the paper is getting us a lot closer to making protein folding a data-driven, mineable field, and that kind of breakthrough would make protein engineering a lot easier and potentially a lot more productive. Does that help a little?

Jay313 · October 20, 2017, 7:45pm

Yes, very much. A helpful explanation of the importance of the topic. Thanks. And good to know my reading skills haven’t totally eroded … like the rest of me!

RHernandez · October 21, 2017, 5:06am

Look at page 15 of this

Besides, at the time they were doing this, sequencing was already cheap. The paper says they were sending their sequencing out.

I do not see how you can interpret my pointing to your false sentence that omits neutral evolution as me denying selection. Please explain.

I think it is a problem because you do this in other writings.

agauger · October 21, 2017, 6:26am

@sfmatheson,
It is clear you don’t like the use to which Doug’s paper has been put. Let me take things one at a time.

Doug was writing about stably folded proteins with an enzymatic function. He acknowledges that the presence of what we now call intrinsically disordered proteins might account for the different estimates of functional folds derived by different methods, those he termed the forward and reverse approaches.

One of
two approaches is typically used in these studies.
The first, which could be termed the forward
approach, involves producing a large collection of
sequences with no specified resemblance to known
functional sequences and searching either for
function or for properties generally associated
with functional proteins. If the relevant sort of
properties can be found among more or less
random sequences, this provides a direct demonstration
of their prevalence. The second approach
works in reverse from an existing functional
sequence. Here, the question is how much randomization
a sequence known to have the relevant
sort of function can withstand without losing that
function.

He was aware of the existence of both proteins with tertiary structure and intrinsically disordered proteins, though he did not call them such.

Because forward-approach studies showing function
to be much more prevalent than indicated here
do not report tertiary structure,3–5 the possibility
that the reported functions might not require such
structure must be considered. The fact that peptides
too small to fold may bind ligands,31 and even show
some catalytic activity,17 shows that these functions
do not necessarily imply folded structure. Similarly,
larger proteins may avoid proteolysis in vivo,32,33
exhibit cooperative thermal denaturation,34 and
even possess catalytic activity32 without having
native-like tertiary structure.

The studies I have seen that use the forward approach generally have difficulty demonstrating that their screens produce functional folds. A protein originally derived from an ATP binding screen required considerable massaging before it was soluble on its own and could be crystallized. (A novel ADP- and zinc-binding fold from function-directed in vitro evolution. 2004 Paola Lo Surdo1, Martin A Walsh2 & Maurizio Sollazzo1, Nature Mol Struct Biol). What the results of Neem et al are remain to be seen.

Studies that use random sequence libraries (the forward approach) typically come up with a higher estimate of the frequency of functional sequences. As Doug said, this may be because the functions being searched for do not require such structure, and so the proteins may be unfolded, obviously a much easier state to achieve. What fraction of extant proteins are intrinsically disordered? One paper says 2% archaea, 4% eubacteria, and 33% eukaryotes. (Ward et al 2004 JMB). What proportion intrinsically disordered does that translate for the protein universe? I don’t know, and I don’t think anyone else does either. But my guess is that the stably folded proteins are the vast majority of the protein structural universe.

Studies that estimate the rarity of functional tertiary folds in reverse are the ones that start from an existing protein with a tertiary structure. They typically come up with lower (rarer) frequencies. Doug cites work from 2 other studies besides his own that indicate a rarity of functional fold of 1 in 10E40 or less. Stably fold proteins may actually be a large majority of all proteins.

Your first statement is an exaggeration. That stable structural folds are rare, very rare is not nothing. And it’s not based only on a single protein.

As for the current state of protein studies, I am not reluctant to acknowledge them. The various kinds of computational models, folding studies,and the studies of intrinsically disordered proteins are interesting. I watch with interest the studies that look for functional proteins from random sequences. It’s a very important question. Ancestral reconstruction and recruitment to new function are also interesting. Maybe I should stop answering questions here so I can write about them.

You can call any study you don’t like, or the act of quoting it, gratuitous speculation. That is neither evidence or argument. The same protein literature you urge on me acknowledges the problem. Otherwise they would not still be asking the questions.

agauger · October 21, 2017, 6:32am

No. Neutral drift is not new information. Personally I equate new information with new function. Where to draw the line with recruitment I don’t know. Hence my statement, “Then again it might not be an increase in information if all that happened is the enhancement of a minor activity already present.”

You keep harping on evolving and devolving. I don’t know why.

agauger · October 21, 2017, 6:36am

What I said:[quote=“RHernandez, post:55, topic:36902”]
But correct me if I am wrong–the fixation of most mutations is due to drift, but there is still room for selection to act on favorable alleles. It’s not all drift.
[/quote]

I never said you denied selection. I asked you to correct me if I am wrong.

agauger · October 21, 2017, 7:28am

There were three studies, one with chorismate mutase (10E-40), lac repressor (10E-63) and beta lactamase (10E-74).

You are right. Independent Origins of Subgroup Bl+B2 and Subgroup B3Metallo-β-Lactamases | Journal of Molecular Evolution

gbrooks9 · October 21, 2017, 9:31pm

@agauger,

You wonder why I keep harping on “evolving” vs. “devolving”. It is because it is a bogus foolish distinction.

As to your comment about “Neutral Drift” containing know new information… Most definitions of Drift involve changes in ratios of alleles. While most definitions of Evolution include a reference to changes in the ratio of alleles as part of Evolution (not just new mutations in a population).

Since you originally stated “…it might not be an increase in information if all that happened is the enhancement of a minor activity already present…” I don’t really see how you can “enhance” an activity by changing the ratio of alleles… but without changing information.

Care to re-state your initial phrase?