International Journal on Advanced Science, Engineering and Information Technology, Vol. 8 (2018) No. 5, pages: 2189-2195, DOI:10.18517/ijaseit.8.5.6432

Comparative Analysis of Different Data Representations for the Task of Chemical Compound Extraction

Basel Alshaikhdeeb, Kamsuriah Ahmad


Chemical Compound Extraction refers to the task of recognizing chemical instances such as oxygen nitrogen and others. The majority of studies that addressed the task of chemical compound extraction used machine-learning techniques. The key challenge behind using machine-learning techniques lies in employing a robust set of features. In fact, the literature shows that there are numerous types of features used in the task of chemical compound extraction. Such dimensionality of features can be determined via data representation. Some researchers have used N-gram representation for biomedical-named entity recognition, where the most significant terms are represented as features. Meanwhile, others have used detailed-attribute representation in which the features are generalized. As a result, identifying the best combination of features to yield high-accuracy classification becomes challenging. This paper aims to apply the Wrapper Subset Selection approach using two data representations—N-gram and detailed-attributes. Since each data representation would suit a specific classification algorithm, two classifiers were utilized—Naïve Bayes (for detailed-attributes) and Support Vector Machine (for N-gram). The results show that the application of feature selection using detailed-attributes outperformed that of N-gram representation by achieving a 0.722 f-measure. Despite the higher classification accuracy, the selected features using detailed-attribute representation have more meaning and can be applied for further datasets.


Chemical Compounds Extraction; Data Representation; N-gram; Detailed-Attributes; Naïve Bayes; Support Vector Machine; Attribute Selection

Viewed: 419 times (since abstract online)

cite this paper     download