Statements in which the resource exists as a subject.
PredicateObject
rdf:type
lifeskim:mentions
pubmed:dateCreated
2003-12-10
pubmed:abstractText
The Knowledge Discovery in Databases (KDD) methodology seems to be attractive on the analyze of large clinical databases. In the KDD process, the preprocessing step (data cleaning and handling of missing values) is paramount since it conditions the quality of the results obtained by data mining procedures and represents about 80% of the whole project time. The aims of the present study were to analyze this step and provide tools to handle inconsistent data and missing values. We have broken down the process into 3 main stages: data cleaning--explanatory study of missing values--choice of the procedure used for handling missing values. The data cleaning stage was based on a system of logical rules to correct mistakes and on cluster analysis to discard the poorly filled files. The missing-data mechanism was analyzed by means of multivariate statistical procedures. Two methods to deal with missing values were compared: imputation by the most common value (mode) and imputation using decision trees. This study was performed on a large medical diabetes database (23,601 patients) including numerous missing values. A system of logical rules allowed to correct mistakes on essential parameters (for example, the type of diabetes). Cluster analysis allowed to identify 10% of poorly filled files. After multivariate analysis, the missing-data mechanism could be considered as random. For variables with low number of missing values (< 10%) and categories (< 4), imputation using decision trees provided better results than imputation by mode.
pubmed:language
eng
pubmed:journal
pubmed:citationSubset
T
pubmed:status
MEDLINE
pubmed:issn
0926-9630
pubmed:author
pubmed:issnType
Print
pubmed:volume
95
pubmed:owner
NLM
pubmed:authorsComplete
Y
pubmed:pagination
269-74
pubmed:dateRevised
2004-11-17
pubmed:meshHeading
pubmed:year
2003
pubmed:articleTitle
A preprocessing method for improving data mining techniques. Application to a large medical diabetes database.
pubmed:affiliation
CERIM-Faculté de Médecine-1, Place de Verdun-59045 Lille, France.
pubmed:publicationType
Journal Article