Academia of Data Science

Experimental Result of Feature Selections with PCA on KNN, LDA and SVM Classification

2020-09-07T04:38:32+00:00

Nowadays, cancer classification has used advanced technology such as microarray technology to conduct a research. Microarray is a technology that allows us to measured thousands of genes simultaneously. This technology also have successfully applied in many problems, for example in medical science. Microarray also has shown it ability to diagnose a patient that have specific disease. Thus, this technology used to detect a disease such as cancer, which usually have a binary class. The major drawback in terms of classification of this disease is, the gene expression data produced by microarray have high dimension. To counter this problems, an important genes should be identify and reduce the dimensionality of the microarray data. In this research, six feature selections (Receiver Operating Characteristic curve, Wilcoxon rank sum test, t-statistic, Kruskal-Wallis test statistic, Fisher score, and Gini index) has been used with the combination of Principal Component Analysis (feature extraction) to solve the high dimension problem and produce a new subset of original datasets. Then, the new dataset is classified according to their class. Three classifications (K-Nearest Neighbour, Linear Discriminant Analysis, and Support Vector Machine) are used in this research and the performance of each classifier are calculated and compared. The experimental result shows that, among the feature selections, both Wilcoxon rank sum test with Principal Component Analysis for Linear Discriminant Analysis classifier and Receiver Operating Characteristic curve with Principal Component Analysis for Support Vector Machine classifier shows highest correct rate with 96% which outperformed other feature selections.

Clustering Techniques used for Gene Cancers

2020-09-07T04:09:31+00:00

Melanoma is the deadliest skin cancer. It can be developed in any parts of the human body. The cancer disease can be cured if it is diagnosed early and proper treatment is taken. In cancer classification, there is a problem in handling the large data of cancer. Large data contains meaningless data and redundant data. Therefore, to overcome the problem, many computer approaches for classification have been proposed in the previous literature. This time, the clustering process for melanoma is conducted using Gaussian Mixture Models, KMeans_rcpp, K-Means and Support Vector Machine. Therefore, the purpose of this research is to identify and evaluate the performance of the accuracy of genes that contain melanoma skin cancer using the clustering algorithms.

Keyword Extraction of Biomedical Literature Using Text Mining

2020-09-07T02:45:47+00:00

Textual information gives us more clear information as it is presented using words and characters, which is easy for humans to understand. To extract this kind of information, text mining has come into the new sight of technology. Text mining is the process of extracting non-trivial patterns or knowledge from text documents or from textual databases. The purpose of this research paper is to perform and compare keyword extraction using statistical and linguistic extraction tools for 120 text documents related to hypertension and diabetes disease. In order to draw this comparison, RStudio and Fivefilters which is a statistical-based tool and TerMine and Flexiterm tool which is a linguistic-based tool have been used to demonstrate the process of extracting the specified keyword from the biomedical literature. Thus, classification evaluation using K-Nearest classifier is carried out in order to evaluate and compare the performance of the statistical and linguistic approach using the tools. Experimental results show the comparison and the difference between both tools in executing extraction keywords.

Experimental Result Using K-Means Clustering In Incorporation Local Protein Structure Information

2020-09-07T04:25:15+00:00

In this paper, an overview of structural classes prediction was conducted using classification method to boost the prediction accuracy. Besides, an effective computational method to precisely predict the structural classes of protein was introduced. This paper covers the importance of identifying the optimal number of clusters for predicting structural classes using Kmeans clustering algorithm that was investigated in this paper. The impacts of using fixed length to obtain the protein secondary structure on prediction of structural classes have been analyzed in this paper. Several limitations that occur as well as contributions are also highlighted. Generally, an overview on results and achievements is mentioned. Lastly, this paper outlines a general conclusion based on the methods that have been carried out in this work. Several recommendations on prospective area for future work are emphasized as well.

The Comparison of Algorithm to Determine the Treatment of the Diseases

2020-09-07T03:58:43+00:00

Text mining is an obscure data or procedure to acquire data from the information that we have received. Text mining can combine information from numerous sources. There are two techniques that have been used in text mining which is information extraction and classification. For information extraction, RStudio is used. The method that RStudio uses is tm_map as a median to extract the term keyword. For classification, Support Vector Machine (SVM) is used. Waikato Environment for Knowledge Analysis (WEKA) will act as a tool for SVM to classify the treatment. There are many algorithms in WEKA that will give different results. Therefore, the purpose of this research is to identify whether medical drug or natural remedy is the best treatment.