Also, I would add that the number of ORFans and de novo genes is a very high overestimate by the Nelson studies.
I think this article by Larry Moran is helpful and with good reasoning. Sandwalk: Origin of de novo genes in humans He references this this particular paper which is a must read (and already pointed out by @Argon) Origins of De Novo Genes in Human and Chimpanzee I would also add this study by Eric Lander as a must read: http://www.pnas.org/content/104/49/19428.full.
In spite of what you might have read in the popular literature, there are not a large number of newly formed genes in most species. Genes that appear to be unique to a single species are called âorphanâ genes. When a genome is first sequenced there will always be a large number of potential orphan genes because the gene prediction software tilts toward false positives in order to minimize false negatives. Further investigation and annotation reduces the number of potential genesâŚ
They compared the five genomes to find examples that were only expressed in humans and/or chimpanzees but where similar nontranscribed sequences were present in macaque or macaque and mouse genomesâŚ
The result was 634 human-specific transcribed regions, 780 that were chimpanzee-specific, and 1,300 that were only found in humans and chimps.
We want to know if these are real genes of just spurious transcripts. The first clue is that 94% of these transcribed regions are expressed in testes. ⌠You expect more spurious transcription in testes cells.
They found one human-specific peptide and 6 hominoid-specific peptides by mass spectrometry. By looking at ribosome-associated RNAs they identified 5 additional human-specific and 10 hominoid-specific transcripts. Thus, there are 21 potential de novo protein-coding genes. The median size of the peptides is 76 amino acid residues.
They conclude, â⌠in de novo genes in general there was not a significant decrease in the number of substitutions in the longest ORF when compared to neutrally evolving sequences, suggesting that the majority of these transcripts do not encode functional protein.â
âOur results indicate that the expression of new loci in the genome takes place at a very high rate and is probably mediated by random mutations that generate new active promoters. These newly expressed transcripts would form the substrate for the evolution of new genes with novel functions.â
This is important because it shows us that generation of new genes from ârandomâ sequences is not difficult.
Scientifically, this study is entirely consistent with neutral theory, and (in my opinion) squashes the do novo protein argument against evolution entirely.
In particular, IDist that want to argue all these proteins are functional need to explain:
- Why are most of them in the Testis? Is that really the essence of what makes humans different then chimps?
- Why are all the new proteins so short?
- Why are all the new proteins similar to non-coding regions and known transposable elements?
- Why is there no evidence that most of them are translated into proteins? (this is the mass-spec portion of the study).
- Why does the length distribution of the ORFans exactly match what we expect from neutral theory (http://www.pnas.org/content/104/49/19428.full)?
To be clear, neutral theory make qualitative and quantitaive predictions about de novo genes that are validated entirely by this data. We need a very strong reason to reject this, and I just do not see it in @Paul_Nelson work.