Switch to
Predicate | Object |
---|---|
rdf:type | |
lifeskim:mentions | |
pubmed:issue |
1
|
pubmed:dateCreated |
1997-6-23
|
pubmed:abstractText |
Several computer algorithms now exist for discovering multiple motifs (expressed as weight matrices) that characterize a family of protein sequences known to be homologous. This paper describes a method for performing similarity searches of protein sequence databases using such a group of motifs. By simultaneously using all the motifs that characterize a protein family, the sensitivity and specificity of the database search are increased. We define the p-value for a target sequence to be the probability of a random sequence of the same length scoring as well or better in comparison to all the motifs that characterize the family. (The p-value of a database search can be determined from this value and the size of the database.) We show that estimating the distribution of single motif scores by a Gaussian extreme value distribution is insufficiently accurate to provide a useful estimate of the p-value, but that this deficiency can be corrected by reestimating the parameters of the underlying Gaussian distribution from observed scores for comparison of a given motif and sequence database. These parameters are used to calculate a "reduced variate" which has a Gumbel limiting distribution. Multiple motif scores are combined to give a single p-value by using the sum of the reduced variates for the motif scores as the test statistic. We give a computationally efficient approximation to the distribution of the sum of independent Gumbel random variables and verify experimentally that it closely approximates the distribution of the test statistic. Experiments on pseudorandom sequences show that the approximated p-values are conservative, so the significance of high scores in database searches will not be overstated. Experiments with real protein sequences and motifs identified by the MEME algorithm show that determining an overall p-value based on the combination of multiple motifs gives significantly better database search results than using p-values of single motifs.
|
pubmed:grant | |
pubmed:language |
eng
|
pubmed:journal | |
pubmed:citationSubset |
IM
|
pubmed:chemical | |
pubmed:status |
MEDLINE
|
pubmed:issn |
1066-5277
|
pubmed:author | |
pubmed:issnType |
Print
|
pubmed:volume |
4
|
pubmed:owner |
NLM
|
pubmed:authorsComplete |
Y
|
pubmed:pagination |
45-59
|
pubmed:dateRevised |
2007-11-14
|
pubmed:meshHeading | |
pubmed:year |
1997
|
pubmed:articleTitle |
Score distributions for simultaneous matching to multiple motifs.
|
pubmed:affiliation |
San Diego Supercomputer Center, California 92186-9784, USA. tbailey@sdsc.edu
|
pubmed:publicationType |
Journal Article,
Research Support, U.S. Gov't, P.H.S.,
Research Support, U.S. Gov't, Non-P.H.S.
|