Comparison between Statistical and Linguistic Approaches in Classifying Biomedical Literature


  • Nur Aniq Syafiq Rodzuan
  • Shahreen Kasim
  • Mohanavali Sithambranathan
  • Muhammad Zaki Hassan
  • Ahmed Hussein Ali
  • Muhaini Othman


Textual information gives us more clear information as it is presented using words and characters, which is easy for humans to understand. To extract this kind of information, text mining was introduced as new technology. Text mining is the process of extracting and non-trivial patterns or knowledge from text documents or from textual databases. The purpose of this research paper is to perform and compare keyword extraction using statistical and linguistic extraction tools for 120 text documents related to hypertension, diabetes and stroke disease. In order to draw this comparison, RStudio and Fivefilters, statistical-based tools, and TerMine and Flexiterm, linguistic-based tools have been used to demonstrate the process of extracting the specified keyword from the biomedical literature. Thus, classification evaluation using Naïve Bayes and K-Nearest classifier is carried out in order to evaluate and compare the performance of the statistical and linguistic approaches using these tools. Experimental results show the result of the comparison and the difference between both tools in executing extraction keywords.