Research On Missing Data Imputation Methods On Gene Expression
Abstract
Microarray technologies allows for the monitoring expression levels of thousands of genes under a variety of condition.
Gene expression data are accurate mostly but still contains error within its data set, as the microarray data obtained has many
missing values. The result of microarray experiment consists of data sets with form of large of expression levels of genes as rows and
under different experimental condition as columns and frequently with some value missing. The missing value presence can affect the
result for visualization analysis of gene expression. This brings need to various machine learning methods implementation for this
missing value problem by imputing values into the microarray. Imputation method include the replacement of missing values with
estimated based on several information that originated from set of data. In this research, K-nearest Neighbour, Local Least Square,
Bayesian Principal Component Analysis, mean and median imputation method are used for missing value imputation. The result
from the implementation of imputation method is analyzed for its performance by using two different types of classifiers that is
support vector machine and neural network classification. From the result analysis, imputation technique using K-nearest Neighbour
with highest accuracy value using SVM is 0.9146 and Local Least Square with accuracy value 0.8445 has proven better result in ANN.
SVM have better accuracy compared to ANN after imputation.