Regarding the 10,000bp region in the Zhao et al paper you say this:
To see if you are correct I have downloaded Zhao et al’s sequences from Genbank, aligned them, and taken a look at the variation. Here is a very simple portrayal of the variants that are present in more than two individuals.
Each column is a locus with variation (with only the loci where the minor allele is present in more than two individuals being shown). The number at the top of each column is the position within the 10,000bp sequence. Each row is an individual human. I have coloured in calls that are unambiguously A, T, G or C.
Many of the calls in the data are ambiguous, which means that either the individual was heterozygous at that locus, or the sequence data was poor there. Where there are several ambiguous sites in an individual it is not easy to infer the haplotypes (combinations in which the variants are found within the two parental genomes within the individual).
In the figure I have taken the individuals that had few ambiguity codes, and divided their haplotypes into four groups, where each halplotype group contains differences of up to three mutations. I have labelled these in the second column, and put a spacing row between them. I have also labelled a couple of sequences as R in the second column, as these may be recombinants. Then at the bottom I have placed individuals that have lots of ambiguous bases, making their haplotypes hard to call.
I think it is fairly clear to the eye that the data present can be divided fairly easily into four groups, that could correspond to small variations on four ancestral haplotypes. Given the similarity of them all, the number of ancestral haplotypes could in fact be lower.
This is a very rough and ready analysis that I did partly on the London Underground on my way home from work this evening. I can follow up in more detail if you wish. As a preliminary analysis, I think it supports my point quite nicely.