Inferring population mutation rate and sequencing error rate using the SNP frequency spectrum in a sample of DNA sequences.

TitleInferring population mutation rate and sequencing error rate using the SNP frequency spectrum in a sample of DNA sequences.
Publication TypeJournal Article
Year of Publication2009
AuthorsLiu, X, Maxwell, TJ, Boerwinkle, E, Fu, Y-X
JournalMol Biol Evol
Volume26
Issue7
Pagination1479-90
Date Published2009 Jul
ISSN1537-1719
KeywordsBase Sequence, Computer Simulation, Humans, Mutation, Polymorphism, Single Nucleotide, Sequence Analysis, DNA
Abstract

One challenge of analyzing samples of DNA sequences is to account for the nonnegligible polymorphisms produced by error when the sequencing error rate is high or the sample size is large. Specifically, those artificial sequence variations will bias the observed single nucleotide polymorphism (SNP) frequency spectrum, which in turn may further bias the estimators of the population mutation rate theta =4N mu for diploids. In this paper, we propose a new approach based on the generalized least squares (GLS) method to estimate theta, given a SNP frequency spectrum in a random sample of DNA sequences from a population. With this approach, error rate epsilon can be either known or unknown. In the latter case, epsilon can be estimated given an estimation of theta. Using coalescent simulation, we compared our estimators with other estimators of theta. The results showed that the GLS estimators are more efficient than other theta estimators with error, and the estimation of epsilon is usable in practice when the theta per bp is small. We demonstrate the application of the estimators with 10-kb noncoding region sequence sampled from a human population and provide suggestions for choosing theta estimators with error.

DOI10.1093/molbev/msp059
Alternate JournalMol. Biol. Evol.
PubMed ID19318520
PubMed Central IDPMC2734145
Grant ListP50 GM065509 / GM / NIGMS NIH HHS / United States
5P50 GM 065509-07 / GM / NIGMS NIH HHS / United States