17604112

Source:http://linkedlifedata.com/resource/pubmed/id/17604112

Download in:

Switch to

Custom View

Named Graph Language Inference

Statements in which the resource exists as a subject.
Predicate	Object
rdf:type	pubmed:Citation
lifeskim:mentions	umls-concept:C0002045, umls-concept:C0008902, umls-concept:C0033684, umls-concept:C0525063, umls-concept:C0681935
pubmed:issue	6
pubmed:dateCreated	2008-4-21
pubmed:abstractText	Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.
pubmed:language	eng
pubmed:journal	http://linkedlifedata.com/resource/pubmed/journal/7907378
pubmed:citationSubset	IM
pubmed:chemical	http://linkedlifedata.com/resource/pubmed/chemical/Proteins
pubmed:status	MEDLINE
pubmed:month	Apr
pubmed:issn	0165-022X
pubmed:author	pubmed-author:DhirSomduttaS, pubmed-author:Kertész-FarkasAttilaA, pubmed-author:KocsorAndrásA, pubmed-author:KuzniarArnoldA, pubmed-author:LeunissenJack A MJA, pubmed-author:NetoteiaSergiuS, pubmed-author:NijveenHarmH, pubmed-author:PacurarMirceaM, pubmed-author:PongorSándorS, pubmed-author:SonegoPaoloP
pubmed:issnType	Print
pubmed:day	24
pubmed:volume	70
pubmed:owner	NLM
pubmed:authorsComplete	Y
pubmed:pagination	1215-23
pubmed:meshHeading	pubmed-meshheading:17604112-Algorithms, pubmed-meshheading:17604112-Proteins, pubmed-meshheading:17604112-Sequence Analysis, Protein
pubmed:year	2008
pubmed:articleTitle	Benchmarking protein classification algorithms via supervised cross-validation.
pubmed:affiliation	Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University of Szeged, Aradi vértanúk tere 1., H-6720 Szeged, Hungary. kfa@inf.u-szeged.hu
pubmed:publicationType	Journal Article, Research Support, Non-U.S. Gov't