Evaluation of Average Term Occurrences Weighting Technique for Arabic Textual Information Retrieval

Belal Mustafa Abuata (1), Lama Ali Al Omari (2)
(1) Information Technology, Yarmouk University, Irbid, 21163, Jordan
(2) Yarmouk University, Irbid, 21163, Jordan
Fulltext View | Download
How to cite (IJASEIT) :
Abuata, Belal Mustafa, and Lama Ali Al Omari. “Evaluation of Average Term Occurrences Weighting Technique for Arabic Textual Information Retrieval”. International Journal on Advanced Science, Engineering and Information Technology, vol. 12, no. 6, Dec. 2022, pp. 2312-21, doi:10.18517/ijaseit.12.6.13215.
Information retrieval of documents is an important process in the current time, and the vector space retrieval model uses a term weighting scheme as a basic method for matching queries with documents. Term frequency-Inverse document frequency is a widely used and famous term weighting scheme, and many studies proved its effectiveness in information retrieval. However, this term weighting scheme has some drawbacks like retrieving irrelevant documents, which sometimes reduces effectiveness. From this point, a new term weighting scheme called Term Frequency with Average Term Occurrence was proposed and experienced in the English language to minimize retrieving unnecessary documents. In this paper, an information retrieval system is built for the Arabic language, and Open-Source Arabic Corpora was used to complete experiments. Calculations were made using two schemes which are traditional Term frequency-inverse Document Frequency and proposed Term Frequency with Average Term Occurrence. After that, comparisons of results were made using evaluation measures. With all obtained queries, four case studies with two approaches (stop word removal and stemming) are implemented. In English experiments, stop word removal was applied with another discriminative approach, which calculates the centroid of documents. After the analysis of the results, it was found that the proposed scheme is applicable on Arabic text and applied approaches enhance IR effectiveness if they are both implemented. Furthermore, it was found that stop word removal has a favorable effect on both schemes which was also proved in English experiments.

R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieal, vol. 9. ACM Press NewYourk, 1999.

E. Amigó, F. Giner, J. Gonzalo, and F. Verdejo, “On the foundations of similarity in information access,” Inf. Retr. J., vol. 23, no. 3, pp. 216-254, 2020, doi: 10.1007/s10791-020-09375-z.

D. Harman, “Information Retrieval: The Early Years,” Found. Trends® Inf. Retr., vol. 13, no. 5, pp. 425-577, 2019.

G. Domeniconi, G. Moro, R. Pasolini, and C. Sartori, “A study on term weighting for text categorization: A novel supervised variant of tf.idf,” DATA 2015 - 4th Int. Conf. Data Manag. Technol. Appl. Proc., pp. 26-37, 2015, doi: 10.5220/0005511900260037.

Z. H. Deng, K. H. Luo, and H. L. Yu, “A study of supervised term weighting scheme for sentiment analysis,” Expert Syst. Appl., vol. 41, no. 7, pp. 3506-3513, 2014, doi: 10.1016/j.eswa.2013.10.056.

D. Jones et al., “Improving engineering information retrieval by combining TD-IDF and product structure classification,” Proc. Int. Conf. Eng. Des. ICED, vol. 6, no. DS87-6, pp. 41-50, 2017.

S. Robertson, “Understanding inverse document frequency: On theoretical arguments for IDF,” J. Doc., vol. 60, no. 5, pp. 503-520, 2004, doi: 10.1108/00220410410560582.

I. A. & F. A. Belal Abuata, “Improving arabic question answering system by merging aner technique, updated question classification technique and stop words technique,” J. Theor. Appl. Inf. Technol., vol. 98, no. 23, pp. 24-38, 2020.

K. Chen, Z. Zhang, J. Long, and H. Zhang, “Turning from TF-IDF to TF-IGM for term weighting in text classification,” Expert Syst. Appl., vol. 66, pp. 1339-1351, 2016, doi: 10.1016/j.eswa.2016.09.009.

A. El Mahdaouy, S. O. El Alaoui, and E. Gaussier, “Semantically enhanced term frequency based on word embeddings for Arabic information retrieval,” Colloq. Inf. Sci. Technol. Cist, vol. 0, pp. 385-389, 2016, doi: 10.1109/CIST.2016.7805076.

O. A. S. Ibrahim and D. Landa-Silva, “Term frequency with average term occurrences for textual information retrieval,” Soft Comput., vol. 20, no. 8, pp. 3045-3061, 2016, doi: 10.1007/s00500-015-1935-7.

R. Bentrcia, S. Zidat, and F. Marir, “Extracting semantic relations from the Quranic Arabic based on Arabic conjunctive patterns,” J. King Saud Univ. - Comput. Inf. Sci., vol. 30, no. 3, pp. 382-390, 2018, doi: 10.1016/j.jksuci.2017.09.004.

B. Abuata and A. Al-Omari, “A rule-based stemmer for Arabic Gulf dialect,” J. King Saud Univ. - Comput. Inf. Sci., vol. 27, no. 2, pp. 104-112, 2015, doi: 10.1016/j.jksuci.2014.04.003.

A. El Mahdaouy, í‰. Gaussier, and S. O. El Alaoui, “Exploring term proximity statistic for Arabic information retrieval,” Colloq. Inf. Sci. Technol. Cist, vol. 2015-Janua, no. January, pp. 272-277, 2015, doi: 10.1109/CIST.2014.7016631.

A. A. A. A. Abdulla, H. Lin, B. Xu, and S. K. Banbhrani, “Improving biomedical information retrieval by linear combinations of different query expansion techniques,” BMC Bioinformatics, vol. 17, no. 2, 2016, doi: 10.1186/s12859-016-1092-8.

A. Aizawa, “An information-theoretic perspective of tf-idf measures,” Inf. Process. Manag., vol. 39, no. 1, pp. 45-65, 2003, doi: 10.1016/S0306-4573(02)00021-3.

R. Jin, C. Falusos, and A. G. Hauptmann, “Meta-scoring: Automatically evaluating term weighting schemes in IR without precision-recall,” SIGIR Forum (ACM Spec. Interes. Gr. Inf. Retrieval), pp. 83-89, 2001.

G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Inf. Process. Manag., vol. 24, no. 5, pp. 513-523, 1998.

Z. S. Zubi, “Using some web content mining techniques for Arabic text classification,” Proc. 8th WSEAS Int. Conf. Data Networks, Commun. Comput. DNCOCO ’09, pp. 73-84, 2009.

M. Habib, “An intelligent system for automated arabic text categorization,” 2008.

S. E. Robertson, S. Walker, and M. M. Hancock-Beaulieu, “Large test collection experiments on an operational, interactive system: Okapi at TREC,” Inf. Process. Manag., vol. 31, no. 3, pp. 345-360, 1995, doi: 10.1016/0306-4573(94)00051-4.

S. Jimenez, S. P. Cucerzan, F. A. Gonzalez, A. Gelbukh, and G. Dueñas, “BM25-CTF: Improving TF and IDF factors in BM25 by using collection term frequencies,” J. Intell. Fuzzy Syst., vol. 34, no. 5, pp. 2887-2899, 2018, doi: 10.3233/JIFS-169475.

G. Pandey, Z. Ren, S. Wang, J. Veijalainen, and M. de Rijke, “Linear feature extraction for ranking,” Inf. Retr. J., vol. 21, no. 6, pp. 481-506, 2018, doi: 10.1007/s10791-018-9330-5.

G. A. Tinega, P. W. Mwangi, and D. R. Rimiru, “Text Mining in Digital Libraries using OKAPI BM25 Model,” Int. J. Comput. Appl. Technol. Res., vol. 7, no. 10, pp. 398-406, 2018, doi: 10.7753/ijcatr0710.1003.

A. Lipani, T. Roelleke, M. Lupu, and A. Hanbury, A systematic approach to normalization in probabilistic models, vol. 21, no. 6. Springer Netherlands, 2018.

M. Saad and W. Ashour, “OSAC: Open Source Arabic Corpora,” 6th Int. Conf. Electr. Comput. Syst. (EECS’10), Nov 25-26, 2010, Lefke, Cyprus., pp. 118-123, 2010.

Nicola Ferro, “Reproducibility Challenges in Information Retrieval Evaluation,” J. Data Inf. Qual., vol. 8, no. 2, pp. 1-4, 2017.

Authors who publish with this journal agree to the following terms:

    1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
    2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
    3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).