Topic Modeling for Scientific Articles: Exploring Optimal Hyperparameter Tuning in BERT

— Topic modeling has emerged as a successful approach to uncovering topics from textual data. Various topic modeling techniques have been introduced, ranging from traditional algorithms to those based on neural networks. In this research, we explore advanced topic modeling techniques, including BERT-based approaches, to enhance the analysis of scientific articles. We first investigate a widely used Latent Dirichlet Allocation (LDA) model and then explore the capabilities of BERT, to automatically uncover latent topics within scientific papers. The goal of this study is to identify the optimal hyperparameter setting for BERT-based topic modeling of scientific articles. We conduct experiments across several scenarios involving combinations of word embedding, dimension reduction, and clustering methods. The results were analyzed based on the coherence values, average execution time, number of topics generated, visualization through the inter-topic distance map, and the top-N-words of each topic. Our findings suggest that combination of RoBERTa for word embedding, PCA for dimension reduction, and K-Means for clustering yields superior results among the tested scenarios. Further evaluation of BERT-based topic modeling is necessary to validate these findings and explore its applications in various academic and industrial contexts. The implications of these advanced techniques could significantly streamline the process of staying updated with scientific literature, potentially revolutionizing research methodologies across disciplines.


I. INTRODUCTION
Topic modeling has proven to be a successful approach in extracting meaningful information from vast text corpora.By analyzing a collection of documents, topic modeling aims to uncover the underlying subjects present in the corpus, without the need for explicit supervision [1].This technique enables efficient processing of large datasets while preserving the essential statistical relationships required for various task such as classification, novelty detection, summarization, adhoc information retrieval, and analyzing historical documents.Moreover, topic modeling can compress extensive corpora into a brief summary by identifying and presenting the most frequently occurring topics as groups of linked terms.
In addition to its utility in processing large volumes of text, topic modeling also provides a comprehensible representation of documents, which finds applications in various Natural Language Processing (NLP) tasks [2].The primary objective of topic modeling in NLP is to uncover topics, which are collections of words expressed as a mixture of closely related terms.Furthermore, each document is represented as a combination of one or more relevant themes.Topic modeling has proven valuable in understanding the diverse domains of science, particularly within the field of scientific publications.However, this task presents challenges due to the specificity and evolving nature of scientific documents over the past few decades [3].
Identifying the topics of scientific articles is instrumental in enabling researchers to track research trends and identify emerging areas of interest within their field [4], [5], [6], [7].Moreover, it allows researchers to contextualize their own work within the broader landscape of their discipline and highlight how their work addresses critical questions or contributes to existing knowledge gaps [8], [9].Building upon the work of others and integrating relevant findings into one's own research is a common practice for researchers [6], [10], [11].Accurate topic identification in scientific articles facilitates efficient literature reviews, enabling researchers to swiftly locate relevant papers and stay up to date with the latest findings and developments, ultimately saving valuable time.
Topic models can be categorized into four types based on their underlying modeling techniques: algebraic, fuzzy, probabilistic, and neural [2].The algebraic topic model, such as Latent Semantic Allocation (LSA), was developed in the 1990s [2], represents the corpus as a document term matrix (DTM).Zengul et al. [12]state that LSA and topic modeling are among the most commonly employed methods.LSA is a natural language processing approach that examines associations between text-based terms and documents, assuming that words with similar meanings will occur in similar contexts.On the other hand, Latent Dirichlet Allocation (LDA) is a probabilistic topic model that represents a document as a vector of probabilities [2].Several studies, including one conducted by [4] that combined LDA and SciBERT, have used LDA to enhance classification quality.This study confirmed that adding topic modeling features can improve the quality of topical text classification in the scientific domain.
An empirical comparative study between LSA and LDA was conducted by [3] to investigate the impact of bi-gram collocation and lemmatization on both models.They found that LDA performs relatively better than LSA for topic numbers less than the optimal number reached (17 topics).Topic coherence was assessed in this study using both C_v and Umass metrics.The C_v performance of LSA decreases rapidly as the number of topics increases, while the C_v performance of LDA continues to rise until reaching a peak, after which it progressively declines.This study also discovers that lemmatization benefits C_v coherence when the number of topics is fewer than optimal.Additionally, comparisons between LSA and LDA have also been made in terms of divergence, throughput, quality, and response time [13].According to this study, LDA shows considerably greater accuracy than LSA.However, LSA's computing time is significantly less than LDA's.
Furthermore, a research study [14] reports that LDA is an effective tool for extracting features from text by determining the latent topics present in the collection of documents.However, while LDA is a powerful tool for topic modeling and feature extraction in text data, it faces challenges in topic number optimization.Another study by [15] states that LDA method does not perform well if the number of topics 'k' is not adequately chosen.It struggles to identify topic correlations as well as topic evolution.Moreover, LDA exhibits inferior performance in inference compared to NMF (Non-negative Matrix Factorization) [16].
Recently, topic modeling approaches have seen a growing trend toward integrating neural components, leveraging contextualized representations instead of the traditional bagof-words approach [17].A prominent example is Bidirectional Encoder Representations from Transformers (BERT) is designed to pretrain deep bidirectional representations from unlabeled text.BERT's ability to capture contextual semantic significantly improves the depth and accuracy of topic mining, overcoming limitations of traditional LDA which might ignore such context [18].Through its attention mechanisms, BERT can automatically form topical word cluster similar to those generated by LDA [19].One study [20] utilized the BERT encoder model to encode sentences from textual documents to obtain positional embeddings of topic word vectors.Additionally, BERT has a smaller and faster version called DistilBERT [21], which is trained to mimic the behavior of the larger BERT model while being more computationally efficient and requiring fewer resources.
Additionally, BERT and LDA have shown successful applications in clustering tasks [22].This study employed a hybrid model, combining the probabilistic subject assignment vector from the LDA model with the sentence vectors derived from the BERT model.Research by [23] utilized a combination of LDA and BERT, where LDA identified the most frequently discussed topics in the dataset and BERT classified the sentiment present.Furthermore, BERT embeddings have been successfully used to explore the evolution of topics in scientific publications [24].In this application, LDA was used to create topics.Then the LDA probability value for each word in a topic was multiplied by the averaged tensor similarity using monolingual or multilingual BERT embeddings.Another study [25] employed a hybrid approach, integrating BERT with an incremental community detection algorithm.In this instance, BERT established semantic relations between words in different contexts, while graph mining techniques, supported by simple structural rules, enhanced the resulting topics.Another hybrid model combining BERT and LDA in topic modeling with dimensionality reduction has been thoroughly investigated [26].Clustering algorithms are computationally complex, and their difficulty increases with the number of features.Hence, dimensionality reduction methods such as PCA, t-SNE, and UMAP are used.This framework demonstrates that clustering with dimensionality reduction can lead to more coherent topics.
A Robustly Optimized BERT Pretraining approach (RoBERTa) is a variant of BERT that modifies the original BERT pretraining procedure to enhance end-task performance [27].Developed by Facebook AI, RoBERTa utilizes a significantly larger corpus and more training data, leading to improved language representations and enhanced generalization capabilities.Conversely, DistilRoBERTa is a streamlined version of RoBERTa designed to reduce model size and computational resources while maintaining competitive performance.
This study aims to achieve the best hyperparameter tuning for topic modeling of scientific articles.The primary focus is to fine-tune the parameters of the topic modeling algorithm to ensure the most accurate representation of topics within scientific articles.The paper's organization is as follows: Section 2 presents the data and methodology, Section 3 discusses the results and findings, and Section 4 concludes the paper.

A. Dataset and Evaluation
This research utilized dataset for Research Articles from Kaggle [28] containing 20,972 rows of data in English.Each row includes a title and abstract from a set of research articles.Additionally, there is a column describing the topic based on the actual information provided.The topics included in this dataset are Computer Science, Physics, Mathematics, Statistics, Quantitative Biology, and Quantitative Finance.A detailed count of data per category is shown in Figure 1.Before applying any topic modeling techniques, the dataset underwent a standard preprocessing pipeline.This involved removing hyperlinks, special characters, numbers, and words of one character.We also converted the text to lowercase, as some of the methods we applied are case-sensitive.Stopwords were removed using a list from the NLTK (Natural Language Toolkit) Library.
Tokenization was performed on the dataset, followed by lemmatization using NLTK's library WordNet Lemmatizer, which ensures only base words are used.The final preprocessing step involves converting texts into TF-IDF (term-frequency inverse-document-frequency) weight for the machine learning model.For the deep learning model, the preprocessing step stopped after lemmatization.
Topic coherence is an evaluation metric used to measure the quality and interpretability of topics generated by topic modeling algorithms.It aims to assess how semantically meaningful and coherent the identified topics are.Higher coherence scores indicate that the topics are more coherent and representative topics.Two commonly used methods for calculating topic coherence are C_v and UMass, widely used in topic modeling research to evaluate and compare different algorithms and configurations.
The C_v coherence measures the coherence of topics by evaluating the pairwise word similarity among the most probable words in each topic.It calculates coherence using a pre-defined word embedding model that captures semantic relationships between words [29].C_v coherence correlates well with human judgment regarding topic quality and interpretability.A higher C_v score indicates that the topics are more coherent and linguistically meaningful.
The UMass coherence, also known as the u_mass metric, is an alternative method for evaluating topic coherence [30].Unlike C_v, UMass does not rely on pre-trained word embeddings but uses a document-based approach.This method provides a fast and efficient way to measure topic coherence without requiring external word embeddings.However, it may be less correlated with human judgment than C_v coherence.

B. Proposed Method
This section outlines the methodology employed for conducting the topic modeling experiments.These experiments involve the application of both traditional LDA and BERT-based models, including BERTopic, RoBERTa, and DistilRoBERTa.The leading architecture for this experiment is depicted in Figure 2. In this study, we present our methodology for topic modeling, which encompasses data collection, preprocessing, and applying various topic modeling techniques.As mentioned in the previous section, the dataset used in this research was obtained from Kaggle.Before proceeding with topic modeling, we performed a series of preprocessing steps to ensure the consistency and quality of the data.
To investigate the performance of various topic modeling techniques, we employed two distinct approaches: Latent Dirichlet Allocation (LDA) and BERT-based.We implemented LDA using the Gensim library in Python and conducted a systematic grid search to identify the optimal number of topics (K) for achieving coherent and interpretable results.In contrast, we used the BERTopic library and experimented with various settings to optimize its performance.We also used some hyperparameter tuning settings for this BERTopic.We modify the hyperparameter of word embedding, dimension reduction, and clustering methods for the BERTopic.
We employed the inter-topic distance distribution method to visualize the generated topics, which provides an intuitive representation of the relationships between different topics.By visualizing these distance distributions, we can gain insights into the cohesion and separation among the identified topics, thereby enhancing the interpretability of the results.
Furthermore, we employed both the C_v and UMass coherence metrics to assess the quality of the generated topics.These measures provide valuable insights into the semantic coherence of the issues and their ability to represent distinct concepts within the dataset.
Overall, our proposed method integrates data collection, preprocessing, topic modeling using both LDA and BERTopic, visualization through inter-topic distance distribution, and topic coherence evaluation to analyze the topic modeling process comprehensively.

A. Experiment Scenario
The abundance of scientific articles across diverse fields necessitates efficient and interpretable methods for topic modeling.In this research, in addition to implementing the state-of-the-art topic modeling technique, LDA, we also aim to uncover the most effective hyperparameter settings for topic-modeling scientific articles by leveraging the powerful BERTopic model.Our approach's novelty lies in combining various word embedding, dimension reduction, and clustering methods, enabling us to unlock nuanced and coherent topic representations.
We have integrated various word embedding methods within the BERTopic model to leverage the rich semantic information embedded in scientific articles.Our options include Default Sentence-BERT (S-Bert), RoBERTa, DistilRoBERTa, Gensim FastText, and paraphrase-MiniLM-L3-v2.Each word embedding method offers unique strengths in capturing context and meaning from textual data.We systematically explore the impact of these embeddings on topic modeling performance, aiming to identify the optimal choice for effectively distilling knowledge from scientific research.
As scientific articles often comprise large corpora, dimension reduction methods play a pivotal role in enhancing both the scalability and interpretability of topic modeling.We consider two prominent techniques: Uniform Manifold Approximation and Projection (UMAP), and Principal Component Analysis (PCA).UMAP excels in preserving local structure, making it ideal for maintaining nuanced relationships within the data.In contrast, PCA offers an efficient approach to reducing dimensions, simplifying large dataset while retaining significant variance.
Through extensive experimentation, we investigate the impact of these dimension reduction strategies on topic modeling results, seeking the best approach for balancing computational efficiency and topic coherence.To unveil the underlying structures within scientific articles and facilitate topic segmentation, we deploy two clustering methods: K-Means, and HDBScan.K-Means, a classic partitioning algorithm, seeks to divide data into K clusters.In contrast, HDBScan employs density-based clustering to identify clusters of varying shapes and sizes.By applying these clustering techniques to our BERTopic model, we aim to discover the optimal approach for grouping scientific articles into coherent topics that genuinely reflect their inherent thematic content.
Table 1 presents the details of our experiment scenario.Our research involves an extensive experimental setup, systematically exploring the vast hyperparameter space defined by word embedding methods, dimension reduction techniques, and clustering algorithms across 18 scenarios.We assess the quality and coherence of the generated topics for each configuration through rigorous evaluation.The goal is to identify the hyperparameter settings that yield the most interpretable and semantically meaningful topics, thereby equipping researchers with a robust and effective toolkit for exploring the vast landscape of scientific articles.By combining state-of-the-art language representations, dimension reduction strategies, and clustering methodologies, our research strives to advance the field of topic modeling for scientific articles.The knowledge gained from this study has the potential to enhance our understanding of and ability to navigate the wealth of scholarly literature significantly.This facilitates knowledge discovery and accelerates scientific progress.

B. Experimental Results
When applying a topic modeling technique, such as Latent Dirichlet Allocation (LDA), setting the number of topics (K) is a crucial hyperparameter that must be determined before model training.For each value of K, the coherence score is calculated to evaluate the coherent and semantic meaningfulness of the generated topics.The score serves as a guide to help researchers select the optimal number of topics.A higher coherence score for a particular value of K indicates that the topics are more coherent and represent distinct concepts within the dataset.We evaluate the coherence score for a total of 12 topics as shown in Figure 3.The experiments indicate that the highest coherence score is achieved with six topics, recording a score of 0.58701, while the second highest is two topics, scoring 0.58471; the difference is a mere 0.0023.We will set six as the default number of topics for subsequent experiments based on these results.
To visualize our LDA model, we employ t-SNE plot, as recommended by Genender-Feltheimer for exploratory data analysis [31].T-distributed Stochastic Neighborhood Embedding (tSNE) is an unsupervised Machine Learning algorithm introduced by Maaten and Hinton [32] to visualize high-dimensional data in a low-dimensional space [33].Figure 4 displays the t-SNE plot of our LDA model configured with six topics.We attempt to map the generated topic by category from the initial dataset.In our analysis, these topics fit into four categories, despite the dataset initially having six.This reduction is consistent with the distribution of articles in the original categories, where the other two categories had very few articles.As mentioned in the previous section, we conducted several experiments using Google Collab with GPU settings enabled.Each experiment was run five times across all scenarios, with the number of topics set to six.We compared the execution times when integrating data from abstracts or titles with various hyperparameter scenarios.Time-A indicates the execution time for abstract data, while Time-T represents the execution time for title data.The results suggest that, generally, the execution time for abstract data is longer than for title data due to the more incredible amount of text in abstracts.The shortest execution times were observed when using FastText followed by S-BERT.It can also be concluded that, in general, UMAP requires a longer processing time than PCA.
The variation in execution time can be attributed to several factors.Gensim FastText is a lightweight and relatively simple word embedding model compared to the transformerbased models like S-BERT, RoBERTa and DistilRoBERTa.Transformer models are deep neural networks with multiple layers and many parameters, making them computationally more intensive during training and inference.In contrast, FastText uses a shallow neural network and character-level ngrams, resulting in a smaller model size and faster execution.Transformer-based models incorporate complex selfattention mechanisms that require more computations.Conversely, FastText employs a straightforward averaging mechanism, which is computationally less demanding, which contributes to its quicker execution time.
This research focuses on topic modeling, and we also analyze the number of topics that are generated from the scenarios.Specifically, we concentrate on scenario using HDBScan as the clustering method.We do not consider K-Means because the number of topics formed is pre-determined at six, as we manually set the number of K cluster.The results can be displayed on Table III.Similarly to the time measurement experiments, these experiments also compare the number of topics generated using abstract data (# Topics-A) and title data (# Topics-T) as the inputs.UMAP is renowned for its ability to preserve both local and global data structures, making it particularly effective at capturing complex and nonlinear relationships.It strives to maintain the relative distances between data points, ensuring that similar points are clustered closely in the reduced space.Consequently, UMAP may produce more fine-grained topic representations and reveal more subtle differences.This capability might generate more topics as BERTopic attempts to encapsulate the increased diversity and nuance in the topic space.
On the other hand, PCA is a linear dimension reduction technique that focuses on capturing the most considerable variance within the data.Although it is efficient at reducing dimensionality and can be computationally quicker, PCA may not preserve complex relationships and subtle differences between data points as effectively as UMAP.Therefore, PCA may yield more compact topic representations, which could result in a smaller number of generated topics in BERTopic.
This observation aligns with the results of experiments conducted.Generally, using UMAP as a dimension reduction method results in a highly diverse number of generated topics.In this research, such variability in topic numbers does not aid in effectively identifying scientific articles.
Subsequently, we measured the coherence values for all scenarios, as shown in Table IV.We used C_v and u_mass coherence metrics.These values represent the average outcomes of the five experiments performed and indicate the coherence values for abstract data only.We achieved the best coherence values for both metrics when using HDBScan as a clustering method and UMAP for dimension reduction.However, PCA generally outperforms UMAP in terms of coherence.This pattern was also observed when using K-Means as a clustering method.Moreover, both clustering methods showed that S-BERT and RoBERTa yield better coherence values.
Overall, we conclude that the best coherence values are obtained using K-Means as a clustering method, RoBERTa as a word embedding technique, and PCA as a dimension reduction method.These findings are consistent with the results of previous experiments and are quite close to the coherence value of the LDA experiments.
Figure 5 displays the inter-topic distance map for our data using RoBERTa-PCA-K-Means hyperparameter settings.The prevalent size of the circles indicates that the volume of data on each generated topic is relatively uniform.Additionally, the absence of overlapping topic circles suggests no overlapping topics.An optimal topic model is characterized by large, non-overlapping bubbles that are evenly distributed across the chart.Similar to the LDA results, we have also mapped the generated topics to match the original categories from the dataset.Our mapping aligns with the hierarchical relationship results, as shown in Figure 6.From this figure, it is evident that there are only three major topics.Notably, statistics and mathematics are in the same cluster.We then focused our BERTopic implementation exclusively on computer science data.Figure 7 displays the inter-topic distance map for this subset of data.Similar to the result from the entire dataset, the circles representing each topic are relatively evenly sized and do not overlap.For topics 0 through 5, the respective number of articles in a generated topic are 2211, 1662, 1367, 1261, 1126, and 967.
Upon closer examination, each generated topic corresponds to a specific learning category within Computer Science.Table V displays the top N words for each topic alongside our estimated categories.Given that the fields within computer science often overlap, the top-N-word results listed in Table V are logical and coherent.Despite the diversity of existing documents, we can still construct a topic model.

IV. CONCLUSION
This paper proposes a BERT-based topic modeling approach for scientific articles.Our experiments primarily focused on exploring the hyperparameter tuning for these models while also assessing traditional methods.We found that combining RoBERTa for word embedding, PCA for dimension reduction, and K-Means for clustering yielded the best results across all experiments based on the inter-topic distance map, coherence values and execution time.We also conducted a comparison specifically for computer science articles, and the results demonstrated a consistent trend with those from the broader scientific corpus.Thus, BERT-based models show promise as effective methods for topic modeling.Additionally, we discovered that more evaluation metrics are needed for this problem.Unlike traditional LDA methods, evaluation of BERT-based topic modeling results is still limited, so further exploration is needed.

Fig. 1
Fig. 1 Amount of data per category

Fig. 3
Fig. 3 Topic coherence value in LDA

From
Fig. 5 Inter-topic distance map

Fig. 7
Fig. 7 Inter-topic distance map for Computer Science Articles

Table
II displays the average execution time results.

TABLE V TOP
-N-WORDS FOR COMPUTER SCIENCE DATA