International Journal on Advanced Science, Engineering and Information Technology, Vol. 9 (2019) No. 4, pages: 1460-1465, DOI:10.18517/ijaseit.9.4.8894

Development of Rule-Based Feature Extraction in Multi-label Text Classification

Gugun Mediamer, - Adiwijaya, Said Al Faraby

Abstract

Hadith is the second main guidelines after the Holy Quran in the Islamic religion, which was revealed through the Messenger of Allah. Today, Hadith can classified by more than one class such as advice class, prohibited, and information to facilitate readers of Hadith in filtering the appropriate classes for each Hadith of Rasulullah SAW. In the course of research, there are many kinds of data involved in a text classification study. Therefore, special handling that fit with the characteristics of certain data is required. This study investigates the handling of multi-label data—Hadith Bukhari in Indonesian translation—focusing on feature extraction, feature weighted, and preprocessing methods. This study uses a rule-based feature extraction combined with several types of preprocessing along with three types of feature-weighted methods: TF-IDF, Word2vec, and Word2vec weighted with TF-IDF, the five preprocessing stages in this research: Case Folding, Tokenization, Remove Punctuation, Stopword Removal, and Stemming. From the 13 experiments conducted in this study consist of 2000 hadiths, it was found that the best performance for multi-label classification of Hadith data produced by the combination of the proposed rule-based feature extraction, Word2vec feature weighted method, and without using Stemming and Stopword Removal in the preprocessing phase. The Hamming Loss value obtained from this combination was 0.0623. The results show that our rule-based feature extraction method better than baseline method.

Keywords:

multi-label classification; Bukhari Hadith; feature-weighted; tf-idf; word2vec; hamming loss.

Viewed: 151 times (since Sept 4, 2017)

cite this paper     download