Statements in which the resource exists as a subject.
PredicateObject
rdf:type
lifeskim:mentions
pubmed:issue
6
pubmed:dateCreated
2008-4-21
pubmed:abstractText
Identification of problematic protein classes (domain types, protein families) that are difficult to predict from sequence is a key issue in genome annotation. ROC (Receiver Operating Characteristic) analysis is routinely used for the evaluation of protein similarities, however its results - the area under curve (AUC) values - are differentially biased for the various protein classes that are highly different in size. We show the bias can be compensated for by adjusting the length of the top list in a class-dependent fashion, so that the number of negatives within the top list will be equal to (or proportional with) the size of the positive class. Using this balanced protocol the problematic classes can be identified by their AUC values, or by a scatter diagram in which the AUC values are plotted against positive/negative ratio of the top list. The use of likelihood-ratio scoring (Kaján et al, Bioinformatics,22, 2865-2869, 2007) the bias caused by class imbalance can be further decreased.
pubmed:language
eng
pubmed:journal
pubmed:citationSubset
IM
pubmed:chemical
pubmed:status
MEDLINE
pubmed:month
Apr
pubmed:issn
0165-022X
pubmed:author
pubmed:issnType
Print
pubmed:day
24
pubmed:volume
70
pubmed:owner
NLM
pubmed:authorsComplete
Y
pubmed:pagination
1210-4
pubmed:meshHeading
pubmed:year
2008
pubmed:articleTitle
Balanced ROC analysis (BAROC) protocol for the evaluation of protein similarities.
pubmed:affiliation
Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University of Szeged, Aradi vértanúk tere 1., H-6720 Szeged, Hungary. busarobi@inf.u-szeged.hu
pubmed:publicationType
Journal Article, Research Support, Non-U.S. Gov't