Statements in which the resource exists as a subject.
PredicateObject
rdf:type
lifeskim:mentions
pubmed:issue
9
pubmed:dateCreated
2009-9-2
pubmed:abstractText
New high-throughput sequencing technologies are generating large amounts of sequence data, allowing the development of targeted large-scale resequencing studies. For these studies, accurate identification of polymorphic sites is crucial. Heterozygous sites are particularly difficult to identify, especially in regions of low coverage. We present a new strategy for identifying heterozygous sites in a single individual by using a machine learning approach that generates a heterozygosity score for each chromosomal position. Our approach also facilitates the identification of regions with unequal representation of two alleles and other poorly sequenced regions. The availability of confidence scores allows for a principled combination of sequencing results from multiple samples. We evaluate our method on a gold standard data genotype set from HapMap. We are able to classify sites in this data set as heterozygous or homozygous with 98.5% accuracy. In de novo data our probabilistic heterozygote detection ("ProbHD") is able to identify 93% of heterozygous sites at a <5% false call rate (FCR) as estimated based on independent genotyping results. In direct comparison of ProbHD with high-coverage 1000 Genomes sequencing available for a subset of our data, we observe >99.9% overall agreement for genotype calls and close to 90% agreement for heterozygote calls. Overall, our data indicate that high-throughput resequencing of human genomic regions requires careful attention to systematic biases in sample preparation as well as sequence contexts, and that their impact can be alleviated by machine learning-based sequence analyses allowing more accurate extraction of true DNA variants.
pubmed:commentsCorrections
http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-16301213, http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-17327845, http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-17676041, http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-17803354, http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-17934467, http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-17943122, http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-18183021, http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-18212088, http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-18421352, http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-18576944, http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-18704501, http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-18714091, http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-18775913, http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-18971308, http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-18988837, http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-19088343, http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-19182786, http://linkedlifedata.com/resource/pubmed/commentcorrection/19605794-5420325
pubmed:language
eng
pubmed:journal
pubmed:citationSubset
IM
pubmed:status
MEDLINE
pubmed:month
Sep
pubmed:issn
1549-5469
pubmed:author
pubmed:issnType
Electronic
pubmed:volume
19
pubmed:owner
NLM
pubmed:authorsComplete
Y
pubmed:pagination
1542-52
pubmed:dateRevised
2010-9-24
pubmed:meshHeading
pubmed:year
2009
pubmed:articleTitle
A probabilistic approach for SNP discovery in high-throughput human resequencing data.
pubmed:affiliation
McGill Centre for Bioinformatics, McGill University, Montréal H36 0B1, Canada;
pubmed:publicationType
Journal Article, Research Support, Non-U.S. Gov't, Evaluation Studies