16450363

Source:http://linkedlifedata.com/resource/pubmed/id/16450363

Download in:

Switch to

Custom View

Named Graph Language Inference

Statements in which the resource exists as a subject.
Predicate	Object
rdf:type	pubmed:Citation
lifeskim:mentions	umls-concept:C0008902, umls-concept:C0025663, umls-concept:C0033684, umls-concept:C0205460, umls-concept:C0220825, umls-concept:C0681842, umls-concept:C1511726, umls-concept:C1516769, umls-concept:C1704675, umls-concept:C1880157
pubmed:issue	3
pubmed:dateCreated	2006-4-10
pubmed:abstractText	Protein-protein interactions play a key role in many biological systems. High-throughput methods can directly detect the set of interacting proteins in yeast, but the results are often incomplete and exhibit high false-positive and false-negative rates. Recently, many different research groups independently suggested using supervised learning methods to integrate direct and indirect biological data sources for the protein interaction prediction task. However, the data sources, approaches, and implementations varied. Furthermore, the protein interaction prediction task itself can be subdivided into prediction of (1) physical interaction, (2) co-complex relationship, and (3) pathway co-membership. To investigate systematically the utility of different data sources and the way the data is encoded as features for predicting each of these types of protein interactions, we assembled a large set of biological features and varied their encoding for use in each of the three prediction tasks. Six different classifiers were used to assess the accuracy in predicting interactions, Random Forest (RF), RF similarity-based k-Nearest-Neighbor, Naïve Bayes, Decision Tree, Logistic Regression, and Support Vector Machine. For all classifiers, the three prediction tasks had different success rates, and co-complex prediction appears to be an easier task than the other two. Independently of prediction task, however, the RF classifier consistently ranked as one of the top two classifiers for all combinations of feature sets. Therefore, we used this classifier to study the importance of different biological datasets. First, we used the splitting function of the RF tree structure, the Gini index, to estimate feature importance. Second, we determined classification accuracy when only the top-ranking features were used as an input in the classifier. We find that the importance of different features depends on the specific prediction task and the way they are encoded. Strikingly, gene expression is consistently the most important feature for all three prediction tasks, while the protein interactions identified using the yeast-2-hybrid system were not among the top-ranking features under any condition.
pubmed:grant	http://linkedlifedata.com/resource/pubmed/grant/NLM108730
pubmed:language	eng
pubmed:journal	http://linkedlifedata.com/resource/pubmed/journal/8700181
pubmed:citationSubset	IM
pubmed:status	MEDLINE
pubmed:month	May
pubmed:issn	1097-0134
pubmed:author	pubmed-author:Bar-JosephZivZ, pubmed-author:Klein-SeetharamanJudithJ, pubmed-author:QiYanjunY
pubmed:copyrightInfo	(c) 2006 Wiley-Liss, Inc.
pubmed:issnType	Electronic
pubmed:day	15
pubmed:volume	63
pubmed:owner	NLM
pubmed:authorsComplete	Y
pubmed:pagination	490-500
pubmed:dateRevised	2007-11-14
pubmed:meshHeading	pubmed-meshheading:16450363-Computational Biology, pubmed-meshheading:16450363-Databases, Protein, pubmed-meshheading:16450363-Forecasting, pubmed-meshheading:16450363-Protein Interaction Mapping
pubmed:year	2006
pubmed:articleTitle	Evaluation of different biological data and computational classification methods for use in protein interaction prediction.
pubmed:affiliation	School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA.
pubmed:publicationType	Journal Article, Comparative Study, Research Support, U.S. Gov't, P.H.S., Research Support, U.S. Gov't, Non-P.H.S., Research Support, Non-U.S. Gov't, Validation Studies