Development of Classification Method for Lecturer Area of Expertise Based on Scientific Publication Using BERT

Didi Rustam; Adang Suhendra; Suryadi Harmanto; Ruddy Suhatril; Dwi Fajar Saputra; Rusdan Tafsili; Rizki Prasetya

doi:10.18517/ijaseit.14.3.19893

DOI : https://doi.org/10.18517/ijaseit.14.3.19893

Development of Classification Method for Lecturer Area of Expertise Based on Scientific Publication Using BERT

Didi Rustam ⁽¹⁾, Adang Suhendra ⁽²⁾, Suryadi Harmanto ⁽³⁾, Ruddy Suhatril ⁽⁴⁾, Dwi Fajar Saputra ⁽⁵⁾, Rusdan Tafsili ⁽⁶⁾, Rizki Prasetya ⁽⁷⁾

(1) Department of Information Technology, Gunadarma University, Depok, West Java, Indonesia

(2) Department of Information Technology, Gunadarma University, Depok, West Java, Indonesia

(3) Department of Information Technology, Gunadarma University, Depok, West Java, Indonesia

(4) Department of Information Technology, Gunadarma University, Depok, West Java, Indonesia

(5) Department of Information Science, Universitas Pembangunan Nasional Veteran Jakarta, West Java, Indonesia

(6) Postgraduate Learning of Technology Universitas Negeri Malang, East Java, Indonesia

(7) Postgraduate Learning of Technology Universitas Negeri Malang, East Java, Indonesia

Fulltext View | Download

How to cite (IJASEIT) :

[1]

Didi Rustam, “Development of Classification Method for Lecturer Area of Expertise Based on Scientific Publication Using BERT”, Int. J. Adv. Sci. Eng. Inf. Technol., vol. 14, no. 3, pp. 894–905, Jun. 2024.

Citation Format :

Implementing the Artificial Intelligence concept in higher education can be utilized in the context of Human Resource (HR) talent management. The lecturer portfolio provided by the Integrated Resource Information System (SISTER DIKTI) is expected to give an overview of the profiles of all lecturers and map competencies based on groups of fields of knowledge. However, the map of scientific fields based on SISTER data currently available is still subjective. The data is in the form of a group of lecturers' chosen fields of science, independently selected by each lecturer to recognize their expertise. This study discusses the problem of processing unstructured SISTER data. It looks for mapping solutions and classification methods by developing a strategy for classifying groups of scientific fields from unstructured data input. It is necessary to identify the suitability of the chosen field of science compared to that developed through the tri-dharma through identification based on a mapping of the group of fields that can be extracted from the tri-dharma activity, in this case, research represented by scientific publications recorded on SISTER. Therefore, we need an appropriate model to measure similarity, which can then be classified based on abstract documents and scientific publication titles for the classification of scientific fields using NLP based on classification run on DGX A100. This study aims to develop a classification method from titles and abstracts. Scientific publications contained in SISTER are unstructured data, so a corpus is formed to identify the lecturer's field of science. The results show that the classification method developed in this study can measure the similarity of lecturer publications based on abstracts and titles through a vector formation process based on bidirectional encoders and also produces a deep learning model to classify 24 categories of fields of science with an accuracy of 95,0345 percent on training data and 92.876 percent on the test data.

A. M. and S. Salama, “Discovering Performance Evaluation Features of faculty Members using Data Mining Techniques to Support Decision Making,” Int. J. Comput. Appl., vol. 178, no. 49, pp. 25–29, Sep. 2019, doi: 10.5120/ijca2019919417.

R. J. Suhatril, B. Mutiara, A. Suhendra, and I. M. Wiryana, “Strategi Klasifikasi Melalui Pembobotansimilaritas Berbasis Semantic Network,” Universitas Gunadarma, 2015.

Sang-Bum Kim, Kyoung-Soo Han, Hae-Chang Rim, and Sung Hyon Myaeng, “Some Effective Techniques for Naive Bayes Text Classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 11, pp. 1457–1466, Nov. 2006, doi:10.1109/tkde.2006.180.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” ArXiv, vol. abs/1810.0, 2019.

A. M. and S. Salama, “Discovering Performance Evaluation Features of faculty Members using Data Mining Techniques to Support Decision Making,” International Journal of Computer Applications, vol. 178, no. 49, pp. 25–29, Sep. 2019, doi: 10.5120/ijca2019919417.

A. M. Dai and Q. V Le, “Semi-supervised Sequence Learning,” In Advances in Neural Information Processing Systems (pp. 3079–3087). Montreal, Canada: Curran Associates, Inc, 2015.

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” Tech. Rep., 2018.

J. Howard and S. Ruder, “Universal language model fine-tuning for text classification,” 2018, arXiv:1801.06146. [Online]. Available: http://arxiv.org/abs/1801.06146

M. E. Peters et al., “Deep Contextualized Word Representations,”. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18): Vol 1, 2018, Jun 1-6, New Orleans, LA, USA. Stroudsburg, PA, USA: Association for Computational Linguistics, 2018: 2227-2237.

S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” 2015, arXiv:1508.05326

A. Williams, N. Nangia, and S. R. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference,” 2017, arXiv:1704.05426. [Online]. Available: http://arxiv.org/abs/1704.05426

Dolan, W. B., & Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases.InProc. of the 3rd Int. Workshop on Paraphrasing, pp. 9–16, Jeju island, Korea, 2005

Tjong Kim Sang, E. and Fien De Meulder. “Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition.” Conference on Computational Natural Language Learning, 2003.

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ Questions for Machine Comprehension of Text,” Association for Computational Linguistics (ACL), 2016.

Vaswani, A., Brain, G., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., et al., “Attention is All you Need,” Proceedings.neurips.cc, [Online]. Available: https://proceedings.neurips. cc/paper/7181-attention-is-all-you-need.2023

P. F. Brown, V. J. Della Pietra, P. V de Souza, J. C. Lai, and R. L. Mercer, “Class-Based n-gram Models of Natural Language,” Comput. Linguist., vol. 18, pp. 467–479, 1992.

R. K. Ando and T. Zhang, “A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data,” J. Mach. Learn. Res., vol. 6, pp. 1817–1853, 2005.

J. Blitzer, R. T. Mcdonald and F. Pereira, "Domain adaptation with structural correspondence learning", Proc. Conf. Empirical Methods Natural Lang. Process., pp. 22-23, Jul. 2007.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado and J. Dean: Distributed Representations of Words and Phrases and their Compositionality, Advances in Neural Information Processing Systems 26, pp.3111--3119, 2013.

J. Pennington, R. Socher, and C. Manning, ‘‘GloVe: Global vectors for word representation,’’ in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1532–1543. [Online]. Available: https://aclanthology.org/D14-1162

J. Turian, L. Ratinov and Y. Bengio, Word representations: A simple and general method for semi-supervised learning, 2010.

A. Mnih and G.E. Hinton, "A Scalable Hierarchical Distributed Language Model", Proceedings of Neural Information Processing Systems 21 (NIPS 2008), pp. 1-8, 2008.

Kiros R, Zhu Y K, Salakhutdinov R, Zemel R, Torralba A, Urtasun R, Fidler S. Skip-thought vectors. arXiv: 1506.06726, 2015. https://arxiv.org/abs/1506.06726, June 2017.

L. Logeswaran and H. Lee, “An efficient framework for learning sentence representations,” ArXiv, vol. abs/1803.0, 2018.

Q. V. Le and T. Mikolov, "Distributed Representations of Sentences and Documents", 1st Workshop on Representation Learning for NLP, 2015.

Y. Jernite, S. R. Bowman, and D. A. Sontag, “Discourse-Based Objectives for Fast Unsupervised Sentence Representation Learning,” ArXiv, vol. abs/1705.0, 2017.

F. Hill, K. Cho, and A. Korhonen, “Learning Distributed Representations of Sentences from Unlabelled Data,” arXiv preprint arXiv:1602.03483, 2016.

R. Collobert and J. Weston, "A unified architecture for natural language processing: Deep neural networks with multi task learning", Proc. of ICML, 2008.

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding,” ArXiv, vol. abs/1804.0, 2018.

Z. Qu, X. Song, S. Zheng, X. Wang, X. Song, and Z. Li, “Improved Bayes Method Based on TF-IDF Feature and Grade Factor Feature for Chinese Information Classification,” 2018 IEEE International Conference on Big Data and Smart Computing (BigComp), Jan. 2018, doi: 10.1109/bigcomp.2018.00124.

R. Socher, J. Wu, A. Perelygin, C.D. Manning, J. Chuang et al., "Recursive deep models for semantic compositionality over a sentiment treebank", EMNLP, 2013. Association for Computational Linguistics, Seattle, 2013.

B. Walek and V. Fojtik, “A hybrid recommender system for recommending relevant movies using an expert system,” Expert Systems with Applications, vol. 158, p. 113452, Nov. 2020, doi:10.1016/j.eswa.2020.113452.

H. Saif, M. Fernández, Y. He, and H. Alani, “On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter,” The Open University: Milton Keynes, UK, 2014.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution LicenseÂ that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (SeeÂ The Effect of Open Access).