Statements in which the resource exists as a subject.
PredicateObject
rdf:type
lifeskim:mentions
pubmed:issue
2
pubmed:dateCreated
2009-12-16
pubmed:abstractText
In multivariate regression and classification issues variable selection is an important procedure used to select an optimal subset of variables with the aim of producing more parsimonious and eventually more predictive models. Variable selection is often necessary when dealing with methodologies that produce thousands of variables, such as Quantitative Structure-Activity Relationships (QSARs) and highly dimensional analytical procedures. In this paper a novel method for variable selection for classification purposes is introduced. This method exploits the recently proposed Canonical Measure of Correlation between two sets of variables (CMC index). The CMC index is in this case calculated for two specific sets of variables, the former being comprised of the independent variables and the latter of the unfolded class matrix. The CMC values, calculated by considering one variable at a time, can be sorted and a ranking of the variables on the basis of their class discrimination capabilities results. Alternatively, CMC index can be calculated for all the possible combinations of variables and the variable subset with the maximal CMC can be selected, but this procedure is computationally more demanding and classification performance of the selected subset is not always the best one. The effectiveness of the CMC index in selecting variables with discriminative ability was compared with that of other well-known strategies for variable selection, such as the Wilks' Lambda, the VIP index based on the Partial Least Squares-Discriminant Analysis, and the selection provided by classification trees. A variable Forward Selection based on the CMC index was finally used in conjunction of Linear Discriminant Analysis. This approach was tested on several chemical data sets. Obtained results were encouraging.
pubmed:language
eng
pubmed:journal
pubmed:status
PubMed-not-MEDLINE
pubmed:month
Jan
pubmed:issn
1873-4324
pubmed:author
pubmed:issnType
Electronic
pubmed:day
11
pubmed:volume
657
pubmed:owner
NLM
pubmed:authorsComplete
Y
pubmed:pagination
116-22
pubmed:year
2010
pubmed:articleTitle
Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 3. Variable selection in classification.
pubmed:affiliation
Milano Chemometrics and QSAR Research Group, Department of Environmental Sciences, University of Milano-Bicocca, Piazza della Scienza 1, I-20126 Milano, Italy. davide.ballabio@unimib.it
pubmed:publicationType
Journal Article