Adam, Eve, and human population genetics, part 10: addressing critics—Poythress, chimpanzees, and DNA identity (continued) | The BioLogos Forum

system · April 8, 2015, 2:36pm

Note: In this series, we explore the genetic evidence that indicates humans became a separate species as a substantial population, rather than descending uniquely from an ancestral pair.

In yesterday’s post, we began to explore the arguments of Dr. Vern Poythress in his recent book Did Adam Exist? – specifically, his argument claiming that the true level of identity between the human and chimpanzee genome is on the order of 70%, rather than 96-99%. The first source that Poythress cites in support of his claim is a 2002 study comparing human and chimpanzee sequences:

“The 96 percent figure deals only with DNA regions for which an alignment or partially matching sequence can be found. It turns out that not all the regions of human DNA align with chimp DNA. A technical article in 2002 reported that 28 percent of the total DNA had to be excluded because of alignment problems, and that “for 7% of the chimpanzee sequences, no region with similarity could be detected in the human genome.””

There are several problems here that would not, unfortunately, be apparent to Poythress’s intended audience. These problems, however, are immediately apparent to a geneticist. The first issue is that this paper, published as it was in 2002, cannot be a study comparing the entire human genome with the entire chimpanzee genome. In 2002, only a preliminary draft of the human genome was available (an improved version would be released in 2004). Moreover, chimpanzee genomics was in its infancy in 2002. Consulting the paper itself reveals that the sample size was a tiny fraction of the chimpanzee genome. The first line of the paper’s abstract indicates that the analysis was restricted to a small amount of DNA:

A total of 8,859 DNA sequences encompassing approximately 1.9 million base pairs of the chimpanzee genome were sequenced and compared to corresponding human DNA sequences.

Given that the human and chimpanzee genomes are about 3 billion base pairs long, this paper is describing the results for comparing approximately 0.06% of the two genomes. This 0.06% sample was drawn out of a larger sample of about 3 million base pairs, as the authors describe:

Twenty-eight percent of the total amount of sequence was excluded from the analysis, since the entire sequence, or parts of it, displayed more than one match in the human genome that was not due to known families of repeated sequences. For 7% of the chimpanzee sequences, no region with similarity could be detected in the human genome.

This is the section that Poythress is discussing when he states that “28 percent of the total DNA had to be excluded because of alignment problems”. First, note that the “total DNA” here is the 3 million DNA base pairs of the sample under study, not the total 3 billion DNA base pairs of the entire chimpanzee genome. Second, note that the reason for excluding the sequence from the analysis is not because it does not match the human genome, but because it matches it in more than one location. Genomes have a lot of repetitive DNA, and this is part of the challenge when comparing genomes.

Perhaps a short aside about how genomes are sequenced would be helpful here. An analogy I have used before is to imagine a genome as a long text. The way scientists “read” a text (genome) is by chopping it up into fragments, reading the fragments, and then reconstructing the text by finding where the fragments overlap. This process works well until one encounters repeated text. For example, consider the opening lines of A Tale of Two Cities by Charles Dickens:

As I have written before on this topic, we can reassemble a text with repetitive sequence if we use fragments that are long enough:

If we were to break multiple copies of the original paragraph into random short fragments of a few words each, we could in principle reassemble the entire piece from overlapping segments in the fragments. Where we would run into problems, however, would be with short fragments that are repeated. For example, if we had a fragment that read “it was the” we could not be sure where to place it, since it could match any one of nine locations. The only way to resolve this is to find larger fragments, such as “it was the season of” – which now matches one of only two locations. Better still would be “it was the season of Darkness” which aligns uniquely to only one location.

What Poythress seems to be misunderstanding is that the reason for excluding 28% of the sequence in the 2002 study was because it was the genomic equivalent of an “it was the” sentence fragment. These chimpanzee sequence fragments match the human genome – and may even match it perfectly – but they match in many places. The 2002 study, as an early study, used very short DNA fragments for its analysis. The fact that 28% of those fragments matched more than one location in the human genome is not at all surprising, and is not at all an indication that 28% of the chimpanzee sample is completely unlike human DNA. And even if it were (which it is not) the sample in question is only about one thousandthof the size of the human and chimpanzee genomes – a mere 0.1% of the total genome size.

What then, of the 7% that the authors could not match to the human genome? Here Poythress might have found an argument, except that, once again, this is not 7% of the entire chimpanzee genome, but 7% of one thousandthof the chimpanzee genome. Moreover, since this analysis was performed in 2002 – thirteen years ago – it was done using a draft of the human genome that is far, far inferior to what we have in the present day. As such, it is highly likely that many of the excluded sequences do in fact have a match in the human genome, but failed to find a match in the 2002 draft.

Of course, the way to approach this question is to look at larger data sets, and use the most recent data available. When one does so, one finds that the human and chimpanzee genomes are indeed about 95% identical, genome wide – data that Poythress does not discuss, or even mention.

(As an aside, attempts to minimize the identity between the human and chimpanzee genomes are common among Christians who deny evolution. I have written extensively on this topic in the past with respect to the Discovery Institute and Reasons to Believe (PDF), for example – and interested readers will find a much more thorough discussion in those sources. Interestingly, my friend and colleague Todd Wood – a Young-earth creationist (YEC)– also has expended significant effort to combat these misunderstandings among those holding to anti-evolutionary views. He wrote a seminal paper in the creationist literature on the topic in 2006, and has also strongly critiqued Reasons to Believe on these issues. I have found Todd’s scholarship entirely trustworthy and a fascinating read, given his YEC views.)

As such, Poythress’s argument that human and chimpanzee DNA is only about 70% identical has not yet found scientific support. In the next post in this series, we’ll examine his second line of argument – that a large percentage of human DNA is a better match to other great apes rather than to chimpanzees. Here too we will see that Poythress fails to understand the relevant science, and that the evidence does not support his conclusions.

Further reading on the scientific and theological issues related to Adam and Eve:

Note: this list is mostly drawn from my BioLogos colleague Ted Davis's excellent series on Evolution and Original Sin, found here.

This is a companion discussion topic for the original entry at https://biologos.org/blog/adam-eve-and-human-population-genetics-part-10-addressing-criticspoythress

dcscccc · April 9, 2015, 9:49am

about the different between chimp and human. even if its only 1% its a lot because its mean a different of 30,000,000 bp in the genome. human also have unique proteins, so even one protein can take bilions of years to evolve. sequance space for small protein is about 20^100, so even if the there is about 10^100 functional sequences, its nothing. so we have evidence that chimp and human doesnt split before 6 my.

PGarrison · April 9, 2015, 1:54pm

Just a bit of clarification. It probably sounds strange to many to say “percentage identity.” In ordinary usage, two things are either identical or they aren’t. It’s molecular biologist shorthand for "the percentage of bases (or amino acids in proteins) that are identical in an optimum alignment of the two sequences. In order to determine an optimum alignment, you have to have a scoring scheme that penalizes insertions and deletions, with the penalty getting larger as the indel gets bigger. With unlimited indels you can align any two sequences perfectly by just moving the next base until you hit an identical one in the other sequence. What you get would be meaningless. You can also tell the program to only count matches of 2, 3 or more bases in a row. The exact alignment you get depends on how you set these parameters. For proteins you try to use scoring schemes that reflect the particularities of protein structure. DNA sequences are simpler in many respects since there are only 4 bases, and effects on DNA structure are not so constraining.

There are alignment algorithms that, given a particular scoring scheme, have been proven by mathematicians to always find the best alignment. They are too slow for use on genome sized alignments, so fast algorithms have been developed to deal with these huge sequences. I found in practice that even supposedly perfect algorithms would occasionally miss something that was obvious when you looked at the sequence.

PGarrison · April 9, 2015, 2:01pm

dcsccc, Actually there are quite a few instances where the chimp and human protein are identical. I don’t remember the exact number. There are even cases where this is true between a mouse protein and a human protein.

It would help if you focused on one problem at a time. You’re mixing up the arguments about initial formation of genes in primitive cells a couple of billion years ago with the question of whether humans and chimps share a common ancestor. By any standard chimp and human proteins and genes are very similar. It’s a lot more likely that that reflects a common ancestor than that nearly identical genes somehow came about independently.

johnZ · April 9, 2015, 5:49pm

This may not be a good place to bring up something from part 1 or part 2 of this series, but I didn’t look at them before. However, today I looked at an illustration that was presented of similarity to language transition from Old English (West Saxon) to modern english. In it was stated that changes were gradual and slow… but the truth is that 1066 was listed as termination of old english, when the norman influence greatly changed the language. Along the way, it was not slow internal language changes that made the spoken language different. Even the old English, being brought in primarily from the germanic countries, consisted of four main dialects, not just of one. And the language of the country was changed along the way by outside influences, such as the Normans(vikings) , and then the French (1066)-William the Conqueror. So the analogy would be good in illustrating how languages continually change due to influence of other peoples and lands coming into contact with a people, just as today we have languages influencing each other. And just as today, we find that we are able to learn other languages, so also we know all groups of humans are able to interbreed with each other, even though some groups have been separated for a long time, and have some distinctive differences in appearance.

johnZ · April 9, 2015, 7:14pm

Dennis, thankyou for part 10 of this series; there are some good explanations in it. I don’t think addressing Poythress is maybe the most legitimate way of dealing with scientific challenges, since challenges have been posed by actual scientists about the way some of this information is presented… it would be better to respond to the real challenges.

The change from a supposed 99% similarity to the 95% similarity that is now indicated (and the literature demonstrates reductions in the similarity from the first 99% down to 97.5% to 96% and now to 95 or even 94%) is indeed revealing, and it also has revealed the selectivity of what is included as differences or not. While the 28% unaligned sections were indeed debatable as to a real difference, it is clear that there are 35 million nucleotide differences, as well as 90 Mb of indels.

The genome size difference between chimp and human has been estimated at 7-8% (larger for chimp), and between overall size difference and the sequence and indel differences total up to about 87% similarity, or 13% differences, disregarding a baseline for necessary similarity of all eukaryotic organisms.

While you may be right that there are motivations to minimize the identity similarities, it is as significant that there are motivations to maximize the identity similarities. This by itself is entirely irrelevant, unless it leads to misleading reporting.

Humeandroid · April 9, 2015, 10:31pm

johnZ:

The genome size difference between chimp and human has been estimated at 7-8% (larger for chimp), and between overall size difference and the sequence and indel differences total up to about 87% similarity, or 13% differences, disregarding a baseline for necessary similarity of all eukaryotic organisms.

Do you have references for any of these claims? They sound wrong to me, and contradict what has been published quite recently (see paper I cited in yesterday’s discussion, for example). The 2005 chimp genome paper reports that the chimp genome seems to have grown by about 16Mb since the split. If a haploid chimp genome is about 3200Mb, that would seem to be a difference of 0.5%. I am trying to understand why anyone would have claimed that the two genomes differ in size by 7-8%. Can you explain?

johnZ · April 9, 2015, 11:33pm

Maybe I missed it David, but I did not see any estimation of total genome size in the paper you cited. But I did see this: “Taken together these studies establish that there are far more genomic differences between human and other primate genomes than was originally thought11. Nevertheless it is difficult to precisely estimate what fraction of the genome contains HLS sequences. While the human genome is the most complete and accurate mammalian genome sequence currently available, it still contains many sequence gaps in complex genomic regions, which may harbor important HLS genes49. Confounded by the far more incomplete nature of all other primate genomes, there are likely many HLS changes that have yet to be discovered.”

Genome sizes for humans have been reported from 3.15 billion bp to 3.5 billion bp. Genome sizes for chimps reported from 3.46 to 3.85 billion bp. So estimates for humans are now at 3.2 and for chimps somewhat larger. At even .2 bill bp difference, that’s about 7%.

DennisVenema · April 10, 2015, 12:06am

Hi John,

The 99% value was never reported as a genome-wide value. That value has meaning, because it shows us how quickly DNA changes through single-letter substitutions. DNA can change more rapidly with insertions and deletions (indels) - so that is why biologists are also interested in the overall identity value that counts indels. Biologists know about this, and they report it in the literature - it’s not a secret or a conspiracy against Christians.

The current estimate for the size of the human genome is 3,096,649,726 and the chimp genome is 3,309,577,922 base pairs (as of today on Ensembl). That makes the human genome 93.6% of the size of the chimp genome. If those values hold, then the overall identity estimate may drift down towards 92-93%, but it’s hard to say. Don’t forget that current indel estimates (95%) already include a very, very large portion of that size difference - we’re talking about the small regions that remain to be aligned and sequenced. It’s a mistake, as you have done, to add the length difference to the indel difference - they are pretty much the same sequences, except for the tiny bit we have yet to align. The draft chimpanzee genome we have today, at 6x coverage, may also change (in estimated length, as well as in sequence) meaning that it’s possible that the identity number may rise or fall a tiny amount as we revise the chimp genome also.

All that to say that the older estimates (95%) were remarkably close, and at present are the best published values we have.

Keep in mind that this is just one chimp, and an “average” human. The identity value for comparisons of particular human / chimp individuals will have a range, since there is variation in both populations.

But, all of this is somewhat angels on the head of a pin… the point here is that Poythress is hugely wide of the mark (70%), and seemingly unaware that he is badly misleading folks. If his paper had argued for an overall value of 92% (with a thoughtful discussion of current data) I wouldn’t be critiquing him.

Also, don’t forget that counting every indel as if it were a series of unique mutations (one for every DNA letter removed or added) is biased, since single mutation events can remove or add hundreds or thousands of DNA letters at a time.

For example, as I mentioned on the other thread, compare similar sentences:

It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the

It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the was the

It was the test of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
it was the

If one mutation ( a duplication or a single substitution) can effect either change - and make one 99% identical and the other 94% identical, is it reasonable to say that these sequences are in fact 99 and 94% different from before? Yes, in one sense, and no in another. The indel, for example, matches exactly to the original sequence, just twice. Is it really completely different? That’s how the identity measure scores it. That’s the challenge when comparing indels and substitutions.

DennisVenema · April 10, 2015, 12:59am

Another thing I forgot to mention is that the Y chromosome is a significant part of the overall differences between humans and chimps - the sequence of the Y is quite divergent. Now, is the Y chromosome part of the human genome? Yes, (for males). So, if we exclude the Y, the identity value would rise.

BradKramer · April 10, 2015, 2:15pm

I moved 5 posts to a new topic: Should Evolutionary Creationists use the word “design”?

johnZ · April 10, 2015, 4:22am

I must agree with you Dennis, that I do not believe there is a conscious conspiracy by biologists against Christians. I’m sure they also report in the literature what they are doing. However, abstracts sometimes only cover a small part of their work, and the way abstracts are sometimes written does not give a complete picture of what the research actually did. This is understandable, but in a volatile and sensational scientific area, anything dealing with human beings, origins, health, genetics is as likely to be misinterpreted as it is to be interpreted correctly by the public. When biologists say they have found 99% similarity in the genome, as they were saying in 2002, they could just as legitimately have said that so far they had found similarity in .06% of the genome, and didn’t know about the rest. The general conclusion was drawn subconciously before the evidence, the entire genome, was compared. So 99% similar became the headline of the day, relatively uncontested by scientists, and unqualified by the media. Eventually there were qualifications made generally aware, eventually the indels became included, and the number began to be reduced, eventually leading to a 96.5%, and then to 95% similarity. The point of this is not to debate endlessly about the percentages, although it is good to know what is involved, and how different portions and aspects of the DNA are impacted. And thanks for this, because I am learning some things as we go along. But the point is whether the bias influences how the facts are reported; I don’t mean they are reported falsely. As an example, one person may say that two big white four door cars are virtually identical, because they appear to be. A second person will say that they are completely different because they have different mileage, different engine types, are built in different years, have different color interiors, and one has electric leather seats, and GPS, while the other has cloth seats. The first perspective is not as scientific as the second, even though it is not incorrect.

Comparisons made to one chimp, or to one neandertal would not generally be considered scientific in terms of general conclusions, even though the process might be valid for the individual. However, I have seen at least four chimp genome measurements reported. (Perhaps they were not all compared to human genome.) Whether a single mutation event can remove thousands of letters or only a few, is not really the issue in measuring differences, since the differences are not measured by mutation events, but by differences in the genome, regardless of how many events it took to make them.

However, your point that indels may already be counted in the genome size difference is interesting, and if accurate, then would be valid in terms of not double counting differences. It would seem though, based on the way changes were reported, that indels accounted for about 3-4% difference ( going from 99% to 95% similarity). Thus with a 7% difference in size, another 4% of the size difference is caused by something else, still leading to 91% similarity. However, not all insertions would be allocated to the chimp which has the larger genome; many would be allocated to the human genome, which is smaller and these cannot increase the size difference. So if half the insertions which show a difference apply to the chimp, then that only accounts for about 2% of the difference in genome size. This would drop the similarity to about 89%, if my math holds up. Furthermore, indels consist of both insertions and deletions; can you tell me what proportion of each makes up the 4% difference in genome comparison caused by indels?

When comparing genomes, anything that looks different is different. It doesn’t matter if it is a repeat. It doesn’t matter what the apparent function might be, or whether the geneticist even thinks it has a function or not. What the differences actually do is an entirely different issue, very significant, but still different. The two things should not be conflated. Since the same four bases can only perform their functions based on the information determined by order of location , then a repeat may be as significant as not. The original assumption in genetics was that a large part of the dna was junk dna, serving no function. This has been discovered to be in error so that most of the dna is now thought to be important and functional. (This is somewhat similar to the presumptions about vestigial organs, which were eventually found not to be vestigial after all.)

Regardless of the similarity or disimilarity of the genomes, the aspect of the significance of the genome differences could still be measured differently. For example, many small differences in muscle mass or bone length or hair length may accrue to many genome differences, yet overall these differences do not appear to be so definitively significant, compared to the general variation in both populations. On the other hand, one significant genetic difference could account for a massive physiological difference such as an opposing thumb or the ability to speak or brain size. A different measure would have to be used to measure the practical significance of the differences, and likely it would be almost impossible or at least very difficult and perahps somewhat subjective to attach any mathematics to this type of comparison…

Joao · April 10, 2015, 7:42am

How many? I’m curious.

If you don’t know, please hypothesize. How many human proteins would you predict have no ortholog in the chimp, and vice versa?

Humeandroid · April 11, 2015, 5:02pm

Hi Dennis, I think you are misinterpreting the data on Ensembl. You are quoting lengths from the “Golden Path.” I don’t think these numbers represent genome size. The “Golden Path” represents a set of contigs/clones used to construct the assembly. If I’m not mistaken, it is an early approximation of genome size but can include overlap due to intentional overlap between clones. It is not, I think, the most accurate estimate of genome size, and is not intended to be.

The current Golden Path length for the single chimp assembly is quite big, and seems clearly to be an overestimate of chimp genome size. To get the best estimate for genome size, look at “Total length” for the assembly. That number is currently 3,309,561,368 for chimp and 3,226,010,022 for human. The numbers should be the same at Ensembl. We know that the numbers are likely to change a little, but not to the extent quoted by johnZ and by you. I can find no claim about a significantly larger chimp genome in the literature, although the crude estimates in the Animal Genome Size Database suggest that the chimp genome could be somewhat bigger than the human. I doubt this, though, because the larger estimates are much older–the most recent flow cytometric data puts chimp DNA content equal to human.

Quick note to johnZ: the paper I cited in my previous comment is not about genome size. The data suggesting slight expansion of the chimp genome since the split comes from the 2005 paper, as I indicated. I remain curious about your sources for your claims.

Humeandroid · April 11, 2015, 5:14pm

Er, I see that the 3.3Gb estimate for chimp genome size is not “high” as I suggested, since the total length currently on both sites I linked to are the same as the “Golden Path” which seems not to be an accurate measurement of genome size. I was trying to get to the point that there is no indication that the two genomes differ in size by more than a trifle, and certainly not by 7-8%. It does seem that there is uncertainty of various scales due to vexation from things like segmental duplications.

dcscccc · April 11, 2015, 7:54pm

hi joao,

this paper claim about 60 de novo proteins:

at least 3 of them we know for sure.

johnZ · April 11, 2015, 11:16pm

David, I don’t want to say that I cannot find a way to really get at this issue based on what you are saying. You say that the “Golden Path” does not “seem” to be an accurate measurement of genome size. “Seem” to be? Are you equating bp to Gb? You say the 3.3 Gb estimate is similar for two sites you checked, as well as your “golden path”, but now this is not an indication? Numbers, numbers, numbers, like statistics… you can get them to say whatever you want them to say… why do people get upset then with YEC using the numbers available, or with media reporting of numbers… Okay, enough of that rant. Forget I said that.

I don’t create numbers; I only used what is reported. If genome sizes have been revised to the ensembl numbers, then the genome difference is .2 billion bp, which is about 6.5% difference. If golden numbers are used, then about 3% difference. Which is right? who knows. But for now I will go with the ensembl numbers. For the rest, I presently stand by my comments on the interplay between indels and genome size.

Joao · April 12, 2015, 7:49am

I would say instead that the evidence provided in the paper suggests that there are about 60. It’s not helpful to present evidence as something as vague as a mere claim.

So, was 60 in the range you predicted? Do you have a hypothesis about evolution or creation that could be tested by digging more deeply into this evidence?

dcscccc · April 12, 2015, 11:10am

hi joao- actually i dont said any prediction about unique proteins in humans. i just say that even one protein disprove the claim that human and chimp spliting time was before 6 my. even 1 bilion years will not be enough. the seauence space for 100 aa is 20^100. so lets say that the number of functional systems in this space is about the number of atoms in the universe(10^70). it still mean that we will need somthing like 10^60 mutations to get this protein.

PGarrison · April 12, 2015, 11:36am

There used to be a quotation from one of the early molecular biologists commonly posted in labs. “If it’s not worth doing, it’s not worth doing well.” (In other words, don’t waste your time on something that isn’t important - doing it well doesn’t make it important.) I think it applies to the discussions about overall percentage identities of genomes in the context of disputes about common descent. As much as this may seem like an obvious topic to the non-geneticist, there is no basis for setting a threshold of identity of genomes above which you must conclude common descent and below which you don’t. The accepted view in biology is that we have a common ancestor with yeast or bacteria, where you could only measure significant identity between common proteins or very conserved RNAs like rRNA.

The real arguments for common descent of primates (or other groups) are to be made based on finer analysis of the evidence of shared complex mutations or patterns of mutation in orthologous locations in the genomes. Arguments about percentage identity of genomes is pointless, since no matter what the real percent identity is, nothing can be concluded from it.

The only relevant point is that there is a high degree of identity between humans and other primates in regions of the genome where everything we know suggests strongly that the sequence has no effect on function. Common descent, recently enough that these sequences have not been randomized by mutation, is the obvious explanation for the high degree of identity. However, the reasons for thinking these regions have no effect on function are rather difficult to explain, so this is commonly disputed by anti-evolutionists. The clearer and simpler arguments for common descent are based on the more detailed analyses referred to above, which Dennis and others have presented here.