Implementation of Statistical Feature Selection and Feature Extraction on Cancer Classification

Authors

  • Muhammad Azharuddin Arif
  • Zuraini Ali Shah

Abstract

Nowadays, cancer classification has used advanced technology such as microarray technology to conduct a research. Microarray

is a technology that allows us to measured thousands of genes simultaneously. This technology also have successfully applied in many

problems, for example in medical science. Microarray also has shown it ability to diagnose a patient that have specific disease. Thus, this

technology used to detect a disease such as cancer, which usually have a binary class. The major drawback in terms of classification of this

disease is, the gene expression data produced by microarray have high dimension. To counter this problems, an important genes should be

identify and reduce the dimensionality of the microarray data. In this research, six feature selections (Receiver Operating Characteristic curve,

Wilcoxon rank sum test, t-statistic, Kruskal-Wallis test statistic, Fisher score, and Gini index) has been used with the combination of Principal

Component Analysis (feature extraction) to solve the high dimension problem and produce a new subset of original datasets. Then, the new

dataset is classified according to their class. Three classifications (K-Nearest Neighbour, Linear Discriminant Analysis, and Support Vector

Machine) are used in this research and the performance of each classifier are calculated and compared. The experimental result shows that,

among the feature selections, both Wilcoxon rank sum test with Principal Component Analysis for Linear Discriminant Analysis classifier and

Receiver Operating Characteristic curve with Principal Component Analysis for Support Vector Machine classifier shows highest correct rate

with 96% which outperformed other feature selections.

Issue

Section

Articles