Topic Modeling for Scientific Articles: Exploring Optimal Hyperparameter Tuning in BERT

Maresha Caroline Wijanto (1), Ika Widiastuti (2), Hwan-Seung Yong (3)
(1) Computer Science & Engineering, Department of Artificial Intelligence and Software, Ewha Womans University, Seoul, Republic of Korea
(2) Computer Science & Engineering, Department of Artificial Intelligence and Software, Ewha Womans University, Seoul, Republic of Korea
(3) Computer Science & Engineering, Department of Artificial Intelligence and Software, Ewha Womans University, Seoul, Republic of Korea
Fulltext View | Download
How to cite (IJASEIT) :
Wijanto, Maresha Caroline, et al. “Topic Modeling for Scientific Articles: Exploring Optimal Hyperparameter Tuning in BERT”. International Journal on Advanced Science, Engineering and Information Technology, vol. 14, no. 3, June 2024, pp. 912-9, doi:10.18517/ijaseit.14.3.19347.
Topic modeling has emerged as a successful approach to uncovering topics from textual data. Various topic modeling techniques have been introduced, ranging from traditional algorithms to those based on neural networks. In this research, we explore advanced topic modeling techniques, including BERT-based approaches, to enhance the analysis of scientific articles. We first investigate a widely used Latent Dirichlet Allocation (LDA) model and then explore the capabilities of BERT, to automatically uncover latent topics within scientific papers. The goal of this study is to identify the optimal hyperparameter setting for BERT-based topic modeling of scientific articles. We conduct experiments across several scenarios involving combinations of word embedding, dimension reduction, and clustering methods. The results were analyzed based on the coherence values, average execution time, number of topics generated, visualization through the inter-topic distance map, and the top-N-words of each topic. Our findings suggest that combination of RoBERTa for word embedding, PCA for dimension reduction, and K-Means for clustering yields superior results among the tested scenarios. Further evaluation of BERT-based topic modeling is necessary to validate these findings and explore its applications in various academic and industrial contexts. The implications of these advanced techniques could significantly streamline the process of staying updated with scientific literature, potentially revolutionizing research methodologies across disciplines.

W. Chen, F. Rabhi, W. Liao, and I. Al-Qudah, “Leveraging State-of-the-Art Topic Modeling for News Impact Analysis on Financial Markets: A Comparative Study,” Electronics (Basel), vol. 12, no. 12, p. 2605, Jun. 2023, doi: 10.3390/electronics12122605.

A. Abdelrazek, Y. Eid, E. Gawish, W. Medhat, and A. Hassan, “Topic modeling algorithms and applications: A survey,” Information Systems, vol. 112. Elsevier Ltd, Feb. 01, 2023. doi:10.1016/j.is.2022.102131.

S. Bellaouar, M. M. Bellaouar, and I. E. Ghada, “Topic modeling: Comparison of LSA and LDA on scientific publications,” ACM International Conference Proceeding Series, pp. 59–64, Feb. 2021, doi: 10.1145/3456146.3456156.

A. Glazkova, “Identifying Topics of Scientific Articles with BERT-Based Approaches and Topic Modeling,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12705 LNAI, pp. 98–105, 2021, doi: 10.1007/978-3-030-75015-2_10.

R. K. Gupta, R. Agarwalla, B. H. Naik, J. R. Evuri, A. Thapa, and T. D. Singh, “Prediction of research trends using LDA based topic modeling,” Global Transitions Proceedings, vol. 3, no. 1, pp. 298–304, Jun. 2022, doi: 10.1016/J.GLTP.2022.03.015.

S. Kavvadias, G. Drosatos, and E. Kaldoudi, “Supporting topic modeling and trends analysis in biomedical literature,” Journal of Biomedical Informatics, vol. 110. Academic Press Inc., Oct. 01, 2020. doi: 10.1016/j.jbi.2020.103574.

A. H. Suyanto, T. Djatna, and S. H. Wijaya, “Mapping and predicting research trends in international journal publications using graph and topic modeling,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 30, no. 2, pp. 1201–1213, May 2023, doi:10.11591/ijeecs.v30.i2.pp1201-1213.

E. H. J. Kim, Y. K. Jeong, Y. H. Kim, and M. Song, “Exploring scientific trajectories of a large-scale dataset using topic-integrated path extraction,” J Informetr, vol. 16, no. 1, p. 101242, Feb. 2022, doi:10.1016/j.joi.2021.101242.

P. Kathiria and H. Arolkar, “Trend analysis and forecasting of publication activities by Indian computer science researchers during the period of 2010–23,” Expert Syst, vol. 39, no. 10, p. e13070, Dec. 2022, doi: 10.1111/exsy.13070.

X. Chen et al., “Exploring science-technology linkages: A deep learning-empowered solution,” Inf Process Manag, vol. 60, no. 2, p. 103255, Mar. 2023, doi: 10.1016/j.ipm.2022.103255.

T. Silwattananusarn and P. Kulkanjanapiban, “A text mining and topic modeling based bibliometric exploration of information science research,” IAES International Journal of Artificial Intelligence (IJ-AI), vol. 11, no. 3, pp. 1057–1065, Sep. 2022, doi:10.11591/ijai.v11.i3.pp1057-1065.

F. Zengul et al., “A Practical and Empirical Comparison of Three Topic Modeling Methods Using a COVID-19 Corpus: LSA, LDA, and Top2Vec,” Jan. 2023, doi: 10.24251/hicss.2023.116.

Y. Kalepalli, S. Tasneem, P. D. P. Teja, and S. Manne, “Effective Comparison of LDA with LSA for Topic Modelling,” Proceedings of the International Conference on Intelligent Computing and Control Systems, ICICCS 2020, pp. 1245–1250, May 2020, doi:10.1109/iciccs48265.2020.9120888.

M. Hasan, A. Rahman, M. R. Karim, M. S. I. Khan, and M. J. Islam, “Normalized approach to find optimal number of topics in latent dirichlet allocation (lda),” Advances in Intelligent Systems and Computing, vol. 1309, pp. 341–354, 2021, doi: 10.1007/978-981-33-4673-4_27/figures/7.

A. Goyal and I. Kashyap, “Latent Dirichlet Allocation - An approach for topic discovery,” 2022 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing, COM-IT-CON 2022, pp. 97–102, 2022, doi: 10.1109/com-it-con54601.2022.9850912.

N. N. Hidayati, S. Rochimah, and A. B. Rahardjo, “Software Traceability in Agile Development Using Topic Modeling,” Int J Adv Sci Eng Inf Technol, vol. 12, no. 4, pp. 1410–1420, 2022, doi:10.18517/ijaseit.12.4.15195.

Z. Fang, Y. He, and R. Procter, “BERTTM: Leveraging Contextualized Word Embeddings from Pre-trained Language Models for Neural Topic Modeling,” May 2023, doi:10.48550/arXiv.2305.09329.

G. Tang, X. Chen, N. Li, and J. Cui, “Research on the Evolution of Journal Topic Mining Based on the BERT-LDA Model,” SHS Web of Conferences, vol. 152, p. 03012, 2023, doi: 10.1051/shsconf/202315203012.

M. Talebpour, A. García Seco de Herrera, and S. Jameel, “Topics in Contextualised Attention Embeddings,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13981 LNCS, pp. 221–238, 2023, doi: 10.1007/978-3-031-28238-6_15/figures/2.

H. Gupta and M. Patel, “Method of Text Summarization Using Lsa and Sentence Based Topic Modelling with Bert,” Proceedings - International Conference on Artificial Intelligence and Smart Systems, ICAIS 2021, pp. 511–517, Mar. 2021, doi:10.1109/ICAIS50930.2021.9395976.

R. Silva Barbon and A. T. Akabane, “Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study,” Sensors, vol. 22, no. 21, Nov. 2022, doi:10.3390/s22218184.

E. Atagün, B. Hartoka, and A. Albayrak, “Topic Modeling Using LDA and BERT Techniques: Teknofest Example,” Proceedings - 6th International Conference on Computer Science and Engineering, UBMK 2021, pp. 660–664, 2021, doi:10.1109/ubmk52708.2021.9558988.

S. E. Uthirapathy and D. Sandanam, “Topic Modelling and Opinion Analysis on Climate Change Twitter Data Using LDA And BERT Model.,” Procedia Comput Sci, vol. 218, pp. 908–917, Jan. 2023, doi:10.1016/j.procs.2023.01.071.

Q. Xie, X. Zhang, Y. Ding, and M. Song, “Monolingual and multilingual topic analysis using LDA and BERT embeddings,” J Informetr, vol. 14, no. 3, Aug. 2020, doi: 10.1016/j.joi.2020.101055.

M. Asgari-Chenaghlu, M. R. Feizi-Derakhshi, L. farzinvash, M. A. Balafar, and C. Motamed, “TopicBERT: A cognitive approach for topic detection from multimodal post stream using BERT and memory–graph,” Chaos Solitons Fractals, vol. 151, p. 111274, Oct. 2021, doi: 10.1016/j.chaos.2021.111274.

L. George and · P Sumathy, “An integrated clustering and BERT framework for improved topic modeling,” International Journal of Information Technology, vol. 15, no. 4, pp. 2187–2195, 2023, doi:10.1007/s41870-023-01268-w.

Y. Sun, D. Gao, X. Shen, M. Li, J. Nan, and W. Zhang, “Multi-Label Classification in Patient-Doctor Dialogues With the RoBERTa-WWM-ext + CNN (Robustly Optimized Bidirectional Encoder Representations From Transformers Pretraining Approach With Whole Word Masking Extended Combining a Convolutional Neural Network) Model: Named Entity Study,” JMIR Med Inform, vol. 10, no. 4, Apr. 2022, doi: 10.2196/35606.

B. Densil, “Topic Modeling for Research Articles.” Accessed: Feb. 19, 2023. [Online]. Available: https://www.kaggle.com/datasets/blessondensil294/topic-modeling-for-research-articles

D. Bretsko, A. Belyi, and S. Sobolevsky, “Comparative Analysis of Community Detection and Transformer-Based Approaches for Topic Clustering of Scientific Papers,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13956 LNCS, pp. 648–660, 2023, doi: 10.1007/978-3-031-36805-9_42/figures/6.

A. F. Pathan and C. Prakash, “Unsupervised Aspect Extraction Algorithm for opinion mining using topic modeling,” Global Transitions Proceedings, vol. 2, no. 2, pp. 492–499, Nov. 2021, doi:10.1016/j.gltp.2021.08.005.

C. Flexa, W. Gomes, I. Moreira, R. Alves, and C. Sales, “Polygonal Coordinate System: Visualizing high-dimensional data using geometric DR, and a deterministic version of t-SNE,” Expert Syst Appl, vol. 175, Aug. 2021, doi: 10.1016/j.eswa.2021.114741.

W. Zhu, Z. Webb, X. Han, K. Mao, W. Sun, and J. Romagnoli, “Generic Process Visualization Using Parametric t-SNE,” Elsevier B.V., Jan. 2018, pp. 803–808. doi: 10.1016/j.ifacol.2018.09.262.

T. T. Cai and R. Ma, “Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data,” Journal of Machine Learning Research, vol. 23, pp. 1–54, May 2022, doi:10.48550/arXiv.2105.07536.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

    1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
    2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
    3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).