Automatic Semantic Annotation of Indonesian Language Phrase Using N-Gram Language Model

Dewi Wardani (1), Chania Evangelista (2)
(1) Department of Data Science, Universitas Sebelas Maret, Jl Ir Sutami 36 A Surakarta, Indonesia
(2) Department of Informatics, Universitas Sebelas Maret, Jl Ir Sutami 36 A Surakarta, Indonesia
Fulltext View | Download
How to cite (IJASEIT) :
Wardani, Dewi, and Chania Evangelista. “Automatic Semantic Annotation of Indonesian Language Phrase Using N-Gram Language Model”. International Journal on Advanced Science, Engineering and Information Technology, vol. 14, no. 5, Oct. 2024, pp. 1581-8, doi:10.18517/ijaseit.14.5.19499.
Building semantic data populations in unstructured data or text is challenging. In this type of data, several problems can be raised, some of which are difficult to analyze. Some groups of words or expressions cannot be defined according to their meaning and can be a source of ambiguity. It can have a different meaning depending on the context of its use. This work aims to automatically annotate Indonesian Language text, especially phrases, with the existing knowledge base. The result is text with semantic markup. Machines can automatically process this type of text because it describes its meaning. This work applies an n-gram language model to identify meaningful phrases and defines them as a unit so that every existing word or phrase is automatically semantically tagged. This work uses the DBpedia and schema.org knowledge base. The percentage of successfully labeled data in this job was 78% with 84.95% accuracy using DBpedia and 5.9% with 97.46% accuracy using schema .org. Some factors affect the accuracy score, including the availability of the required data with the data contained in the knowledge base, the system's ability in the POS tagging process, and many new terminology and local cultures that have not yet been contained in the knowledge bases, especially schema.org that is utilized as a standard for all search engines. This work will help the machine understand the semantics of text data. All pages obtained will be semantically tagged and, therefore, will be understood by machines. This ability will support the following processes.

A. C. Anadiotis et al., “Graph integration of structured, semistructured and unstructured data for data journalism,” Inf Syst, vol. 104, p. 101846, 2022, doi: 10.1016/j.is.2021.101846.

B. Sejdiu, F. Ismaili, and L. Ahmedi, “A Real-Time Semantic Annotation to the Sensor Stream Data for the Water Quality Monitoring,” SN Comput Sci, vol. 3, no. 3, p. 254, 2022, doi: 10.1007/s42979-022-01145-6.

A. Patel and S. Jain, “Present and future of semantic web technologies: a research statement,” International Journal of Computers and Applications, vol. 43, no. 5, pp. 413–422, 2021, doi: 10.1080/1206212X.2019.1570666.

P. Hitzler, “A review of the semantic web field,” Commun ACM, vol. 64, no. 2, pp. 76–83, 2021, doi: 10.1145/3397512.

L. Tamine and L. Goeuriot, “Semantic information retrieval on medical texts: Research challenges, survey, and open issues,” ACM Computing Surveys (CSUR), vol. 54, no. 7, pp. 1–38, 2021, doi: 10.1145/3462476.

T. Shaik et al., “A Review of the Trends and Challenges in Adopting Natural Language Processing Methods for Education Feedback Analysis,” IEEE Access, vol. 10, pp. 56720–56739, 2022, doi: 10.1109/ACCESS.2022.3177752.

M. Wang, J. Yuan, Q. Qian, Z. Wang, and H. Li, “Semantic Data Augmentation based Distance Metric Learning for Domain Generalization,” in Proceedings of the 30th ACM International Conference on Multimedia, in MM ’22. New York, NY, USA: Association for Computing Machinery, 2022, pp. 3214–3223. doi: 10.1145/3503161.3547866.

X. Chen et al., “Imagine by Reasoning: A Reasoning-Based Implicit Semantic Data Augmentation for Long-Tailed Classification,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, pp. 356–364, Jun. 2022, doi: 10.1609/aaai.v36i1.19912.

J. Zhang, Y. Zhang, and X. Xu, “ObjectAug: Object-level Data Augmentation for Semantic Image Segmentation,” in 2021 International Joint Conference on Neural Networks (IJCNN), 2021, pp. 1–8. doi: 10.1109/IJCNN52387.2021.9534020.

S. Albukhitan, A. Alnazer, and T. Helmy, “Framework of semantic annotation of Arabic document using deep learning,” Procedia Comput Sci, vol. 170, pp. 989–994, 2020, doi: 10.1016/j.procs.2020.03.096.

Y. Pu, Y. Han, Y. Wang, J. Feng, C. Deng, and G. Huang, “Fine-Grained Recognition with Learnable Semantic Data Augmentation,” IEEE Transactions on Image Processing, vol. 33, pp. 3130–3144, 2024, doi: 10.1109/TIP.2024.3364500.

A. A. Kardan, M. F. Sani, and S. Modaberi, “Implicit learner assessment based on semantic relevance of tags,” Comput Human Behav, vol. 55, pp. 743–749, 2016, doi: 10.1016/j.chb.2015.10.027.

A. Iliadis, A. Acker, W. Stevens, and S. B. Kavakli, “One schema to rule them all: How Schema. org models the world of search,” J Assoc Inf Sci Technol, 2022, doi: 10.1002/asi.24744.

J. Shafi, R. M. Adeel Nawab, and P. Rayson, “Semantic Tagging for the Urdu Language: Annotated Corpus and Multi-Target Classification Methods,” ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 22, no. 6, Jun. 2023, doi: 10.1145/3582496.

B. Bostanipour and G. Theodorakopoulos, “Joint obfuscation of location and its semantic information for privacy protection,” Comput Secur, vol. 107, p. 102310, 2021, doi: 10.1016/j.cose.2021.102310.

N. Wahab et al., “Semantic annotation for computational pathology: multidisciplinary experience and best practice recommendations,” J Pathol Clin Res, vol. 8, no. 2, pp. 116–128, 2022, doi: 10.1002/cjp2.256.

A. L. Santos, G. Prendi, H. Sousa, and R. Ribeiro, “Stepwise API usage assistance using n-gram language models,” Journal of Systems and Software, vol. 131, pp. 461–474, 2017, doi: 10.1016/j.jss.2016.06.063.

K. D. Goyal, M. R. Abbas, V. Goyal, and Y. Saleem, “Forward-Backward Transliteration of Punjabi Gurmukhi Script Using N-Gram Language Model,” ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 22, no. 2, Dec. 2022, doi: 10.1145/3542924.

D. A. Dahl, “Natural Language Semantics Markup Language for the Speech Interface Framework,” W3C Working Draft WD-nl-spec-Nov, vol. 20, p. 2000, 2000, doi: 10.1002/9780470060599.ch10.

B. Drury, R. Fernandes, M.-F. Moura, and A. de Andrade Lopes, “A survey of semantic web technology for agriculture,” Information Processing in Agriculture, vol. 6, no. 4, pp. 487–501, 2019, doi: 10.1016/j.inpa.2019.02.001.

R. Yu, U. Gadiraju, B. Fetahu, O. Lehmberg, D. Ritze, and S. Dietze, “KnowMore–knowledge base augmentation with structured web markup,” Semant Web, vol. 10, no. 1, pp. 159–180, 2019, doi: 10.3233/SW-180304.

S. Albukhitan, A. Alnazer, and T. Helmy, “Semantic annotation of arabic web documents using deep learning,” Procedia Comput Sci, vol. 130, pp. 589–596, 2018, doi: 10.1016/j.procs.2018.04.108.

S. Cardoso et al., “Use of a modular ontology and a semantic annotation tool to describe the care pathway of patients with amyotrophic lateral sclerosis in a coordination network,” PLoS One, vol. 16, no. 1, p. e0244604, 2021, doi: 10.1371/journal.pone.0244604.

I. N. P. Trisna and A. Nurwidyantoro, “Single document keywords extraction in Bahasa Indonesia using phrase chunking,” Telkomnika (Telecommunication Computing Electronics and Control), vol. 18, no. 4, pp. 1917–1925, 2020, doi: 10.12928/telkomnika.v18i4.14389.

J. Tian, J. Yu, C. Weng, Y. Zou, and D. Yu, “Improving Mandarin End-to-End Speech Recognition with Word N-Gram Language Model,” IEEE Signal Process Lett, vol. 29, pp. 812–816, 2022, doi: 10.1109/LSP.2022.3154241.

S. Diao, R. Xu, H. Su, Y. Jiang, Y. Song, and T. Zhang, “Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R. Navigli, Eds., Online: Association for Computational Linguistics, Aug. 2021, pp. 3336–3349. doi: 10.18653/v1/2021.acl-long.259.

R. and A. D. P. Avasthi Sandhya and Chauhan, “Processing Large Text Corpus Using N-Gram Language Modeling and Smoothing,” in Proceedings of the Second International Conference on Information Management and Machine Intelligence, A. K. and P. V. and G. M. and P. M. Goyal Dinesh and Gupta, Ed., Singapore: Springer Singapore, 2021, pp. 21–32.

A. Chiche and B. Yitagesu, “Part of speech tagging: a systematic review of deep learning and machine learning approaches,” J Big Data, vol. 9, no. 1, p. 10, 2022, doi: 10.1186/s40537-022-00561-y.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

    1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
    2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
    3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).