Hi all,

I have been doing a bit more reading about the theoretical background of some of the methods we have been discussing here. I am not a mathematician, so much of this is outside of my area of expertise. However, I have come across three papers that suggest that seem to suggest that site frequency spectra (as presented earlier in this discussion) have severe limitations as a source of evidence about past population sizes. The second of these papers specifically examines scenarios of a bottleneck followed by exponential population growth.

Simon Myers, Charles Fefferman, Nick Patterson **Can one learn history from the allelic spectrum?** Theoretical Population Biology, Volume 73, Issue 3, 2008, pp. 342-348

https://www.sciencedirect.com/science/article/pii/S0040580908000038

Abstract: It is well known that the neutral allelic frequency spectrum of a population is affected by the history of population size. A number of authors have used this fact to infer history given observed allele frequency data. We ask whether perfect information concerning the spectrum allows precise recovery of the history, and with an explicit example show that the answer is in the negative. This implies some limitations on how informative allelic spectra can be.

Terhorst, Jonathan, and Yun S. Song. **Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum**. Proceedings of the National Academy of Sciences 112.25 (2015): 7677-7682.

http://www.pnas.org/content/112/25/7677.short

Abstract: The sample frequency spectrum (SFS) of DNA sequences from a collection of individuals is a summary statistic that is commonly used for parametric inference in population genetics. Despite the popularity of SFS-based inference methods, little is currently known about the information theoretic limit on the estimation accuracy as a function of sample size. Here, we show that using the SFS to estimate the size history of a population has a minimax error of at least O(1/log s), where s is the number of independent segregating sites used in the analysis. This rate is exponentially worse than known convergence rates for many classical estimation problems in statistics. Another surprising aspect of our theoretical bound is that it does not depend on the dimension of the SFS, which is related to the number of sampled individuals. This means that, for a fixed number s of segregating sites considered, using more individuals does not help to reduce the minimax error bound. Our result pertains to populations that have experienced a bottleneck, and we argue that it can be expected to apply to many populations in nature.

Baharian, Soheil, and Simon Gravel. **“On the decidability of population size histories from finite allele frequency spectra.”** Theoretical population biology (2018).

https://www.sciencedirect.com/science/article/pii/S004058091730148X

Abstract: Understanding the historical events that shaped current genomic diversity has applications in historical, biological, and medical research. However, the amount of historical information that can be inferred from genetic data is finite, which leads to an identifiability problem. For example, different historical processes can lead to identical distribution of allele frequencies. This identifiability issue casts a shadow of uncertainty over the results of any study which uses the frequency spectrum to infer past demography. It has been argued that imposing mild ‘reasonableness’ constraints on demographic histories can enable unique reconstruction, at least in an idealized setting where the length of the genome is nearly infinite. Here, we discuss this problem for finite sample size and genome length. Using the diffusion approximation, we obtain bounds on likelihood differences between similar demographic histories, and use them to construct pairs of very different reasonable histories that produce almost-identical frequency distributions. The finite-genome problem therefore remains poorly determined even among reasonable histories, where fits to few-parameter models produce narrow parameter confidence intervals, large uncertainties lurk hidden by model assumption."

So I think I should add these to the criticism I made earlier of this approach to @glipsnort here:

In addition, I came across this paper which @DennisVenema may find interesting as he writes his blog about the PSMC method

Kim, J., Mossel, E., Rácz, M. Z., & Ross, N. (2015). **Can one hear the shape of a population history?**. Theoretical population biology, 100, 26-38.

http://www.sciencedirect.com/science/article/pii/S0040580914000987?via%3Dihub

I have also been reading up more on ARGweaver and intend to post again on this soon @Swamidass .