KNN Classification in Diabetes Document
Abstract
Information extraction is one of the technique in text mining which extract the information or knowledge from the unstructured form of document. This paper will compare the performance of two approaches in information extraction technique: statistical and linguistic approaches in classifying the diabetes document. Both approaches will extract the information related to the risk factor of diabetes from the biomedical literature. The comparison for both approaches will help improve the performance of the information extraction technique in extracting information from biomedical document. There are many research focus to extract information from natural language text document but little research focus on biomedical text literature. Comparing between two approaches of information extraction technique will help to improve the performance in extracting information from biomedical literature. This paper use two different tools to extract term from the abstract of the related journal. The tools used for statistical approach is fivefilters and linguistic approach is Flexiterm tool. The dataset collected in this project contain only the abstract and title part of the journal which related to diabetes disease retrieved from PubMed. The total dataset used in this research is 104 document. To measure the performance for the extracted term for both approaches, text classification technique is used with K-Nearest Neighbors classifier is applied for the classification process. The dataset is split into 70% for training which contribute to 73 document and 30% which equal to 31 document for testing data set. The result from the classification of both approaches return the average accuracy of statistical approach is 80.65% and linguistic approach with 85.71%. From the result obtained it showed that linguistic approach is the better approach to extract information from biomedical document compared to statistical approach.