@T_aquaticus > How would you rule out random mutation experimentally in these scenarios?
My first impulse would be to run a simulation. Specify GC content, random mutation parameters, and see what happens.
My husband I did a crude simulation of the rarity of ORFs in different GC content random sequence. Specifically we asked how many sequences 900 bp or longer had overlapping ORFs. We were asking it in the context of nylonase, which has 3 (!) non-stop frames. GC rich sequences do much better at having ORFs at all :
According to the nylonase story, as told by Ohno and Venema and numerous others, a new ATG start codon was formed by the insertion of a T between an A and G, thus creating a new start codon after the original ATG, which shifted the reading frame for that sequence to that specified by the new ATG, and creating a completely different coding sequence and thus a new protein. Let us grant that scenario for the sake of argument. Normally such a shift would produce a new coding sequence that would be interrupted by stop codons, so the newly frameshifted protein would be truncated. Thus the only reason this frameshift hypothesis for nylonase is even remotely possible is because the sequence coding for nylonase is most unusual, and contains not one, not two, but three open frames Although frameshift mutations are ordinarily considered to be quite disruptive, at least in this case the putative brand new protein sequence would not terminate early due to stop codons.
My point? The first step to getting a new functional protein of any length from a frameshift is to avoid stop codons. The odds of a random coding sequence having an open alternate frame, without stops, are poor. As a consequence, if a protein does have an open frame in addition to its coding sequence, it’s worth paying attention to. And it so happens that nylonase does have more than one open frame. The DNA sequence above illustrates the six frames, numbering them frames 1 through 6. Using that convention, frames 1 and 3 are read from the sense strand. Both have no stop codons over the length of the gene in the sense direction. Frame 4 on the antisense direction has no stop codons either. Frame 1 is the coding frame that specifies the nylonase protein, otherwise known as the open reading frame (ORF). It is defined by the presence of both a start and stop codon. The other two frames have no start codons or stop codons, so I’ll call them non-stop frames (NSFs). They are frames 3 and 4.
The probability of a DNA sequence with an ORF on the sense strand and 2 NSFs is very small. Just exactly how small are the chances of avoiding a stop codon in three out of six frames? We set out to determine that by performing a numerical simulation using pseudorandom numbers to generate sequences at various levels of GC content. (By we I mean that my husband, Patrick Achey, who is an actuary, did the programming work, while I determined the parameters.) We chose to vary the GC content because sequences with a higher GC content have fewer stop codons. Remember, a stop codon always has an A and a T (TAA, TAG, and TGA are the stop codons) so having a sequence with a lower percentage of AT content will reduce the frequency of stop codons. Conversely, higher GC content makes the chances of avoiding stop codons and getting longer ORFs much greater, thus also increasing the chances of NSFs. The genomes of bacteria vary in their GC content, from less than 20 percent to as much as 75 percent, though the reason why is not known. One species of Flavobacterium has a genome with about 32 percent GC and 2400 genes — the precise values varies with the strain. The plasmid on which nylB resides is very different. It has 65 percent GC content. The gene encoding nylonase has an even higher 70 percent GC content, which is near the observed bacterial maximum of 75 percent.
We chose to use a target ORF size of 900 nucleotides (or 300 amino acids) because it is an average size for a functional protein. Nylonase is 392 amino acids long; the small domain of beta lactamase, the enzyme my colleague Doug Axe studied, is about 150 amino acids long. The median length for an E. coli protein is 278 amino acids; for humans, the median length is 375.
As expected, the simulation showed that the higher the GC content, the greater the likelihood that ORFs that are 900+ nucleotides long exist. At 50 percent GC, the average ORF length we obtained was about 60 nucleotides; most ORFs terminate well before 900 nucleotides. Indeed, in our simulation only two out of a million random sequences made it to 900 nucleotides before encountering a stop codon. As a result, we could not determine the rarity of NSFs at 50 percent GC — we would probably have to run the simulation for more than a billion trials to get any significant number of NSFs at all.
Sequences at 60 percent GC gave 57 ORFs at least 900 nucleotides long out of a million trials, while sequences at 65 percent GC produced 404 out of a million, one of which also had an NSF.
NSFs were much more probable for sequences that were 70 percent GC, like nylB. In our simulation 3,021 out of a million trials were ORFs at least 900 nucleotides long. That’s a frequency of .3 percent. Of those 3,021 ORFs, 86 had 1 NSF, and none had 2 NSFs. We had to run 10 million trials at 70 percent GC to see any ORFs with 2 NSFs. From those 10 million randomly generated sequences, we obtained 28,603 ORFs; 903 had 1 NSF and only 9 had 2 NSFs.
Interestingly, at 80 percent GC we got a few sequences with 4 NSFs; but I don’t know of any bacterium with a GC content that high.
Our simulation shows that multiple NSFs are very rare. The probability that an ORF 900 nucleotides long with 70 percent GC content will have two NSFs is 9 out of 28,603, or 0.0003. If these figures are recast to include the total number of trials required to get an ORF of that length and GC content and with 2 NSFs, the probability would be 9 out of 10,000,000 trials.
A sequence like nylB is very rare. In fact, I suspect that for all cases where overlapping genes exist, in other words where alternate frames from the same sequence have the potential to code for different proteins, unusual sequence will necessarily be found. Likely it will be high in GC content. Could such rare sequences be accidental? I think that if we compare the expected number of alternate or overlapping NSFs per ORF, with the actual number we will find that there are more of these alternate open reading frames than would be predicted by chance.
From another study of overlapping genes:
Thus, bacterial genomes contain a larger number of long shadow ORFs [ORFs on alternate frames] than expected based on statistical analysis. Random mutational drift would have eliminated the signal long ago, if no selection pressures were stabilizing shadow ORFs. Deviations between the statistical model and bacterial genomes directly call for a functional explanation, since selection is the only force known to stabilize the depletion of stop codons. Most shadow genes have escaped discovery, as they are dismissed as false positives in most genome annotation programs. This is in sharp contrast to many embedded overlapping genes that have been discovered in bacteriophages. Since phages reside in a long term evolutionary equilibrium with the bacterial host genome, we suggest that overlooked shadow genes also exist in bacterial genomes.
Indeed, a study of the pOAD2 plasmid from which nylB came indicates that there are potentially many overlapping genes on that plasmid. nylB′, for example, a homologous gene on the same plasmid that differs by 47 amino acids from nylB, also has 2 NSFs. These unusual and unexpected features of DNA have consequences for how we think about the origin of information in DNA sequences, as I shall discuss in the next post.