Medical Record Document Search with TF-IDF and Vector Space Model (VSM)

Lukman Heryawan (1), Dian Novitaningrum (2), Kartika Rizqi Nastiti (3), Salsabila Nurulfarah Mahmudah (4)
(1) Department of Computer Science and Electronics, Universitas Gadjah Mada, Sekip Utara Bulaksumur, Yogyakarta 55281, Indonesia
(2) Master Program in Computer Science, Universitas Gadjah Mada, Sekip Utara Bulaksumur, Yogyakarta 55281, Indonesia
(3) Master Program in Computer Science, Universitas Gadjah Mada, Sekip Utara Bulaksumur, Yogyakarta 55281, Indonesia
(4) Master Program in Computer Science, Universitas Gadjah Mada, Sekip Utara Bulaksumur, Yogyakarta 55281, Indonesia
Fulltext View | Download
How to cite (IJASEIT) :
Heryawan, Lukman, et al. “Medical Record Document Search With TF-IDF and Vector Space Model (VSM)”. International Journal on Advanced Science, Engineering and Information Technology, vol. 14, no. 3, June 2024, pp. 847-52, doi:10.18517/ijaseit.14.3.19606.
The growth of medical record documents is increasing over time, and the various types of diseases and therapies needed are increasing. However, this has not been followed by an effective and efficient search process. This study aims to deal with search problems that often take a long time with search results that are not necessarily as expected by building a search model for medical record documents using the vector space model (VSM) and TF-IDF methods. The VSM method allows retrieval of results that are not the same as the search queries entered by the user but are expected to provide still results relevant to the user's desired needs. The model development process was taken based on the data in the FS_ANAMNESA and FS_DIAGNOSA columns, followed by preprocessing, which consists of deleting blank lines, lowercase, removing punctuation marks, HTML tags, stop words, excess spaces between words, and normalizing typo words, then forming a TF-IDF matrix based on the frequency of occurrence of each word feature, and followed by the calculation of the similarity value of the search query compared to medical record documents based on the cosine similarity formula. The retrieval results were all columns of each existing medical record document and were sorted based on 10 rows with the highest similarity value. The model evaluation results were based on 1000 medical record documents and tested with 20 search queries in this study, which gave an average precision value of 0.548 and an average recall value of 0.796.

I. Rosadi and M. I. Purnama, “Analysis Of Time Analysis Of Outstanding Medical Records To Improve The Quality Of Services At Dustira Hospital, Cimahi,” KESANS: International Journal of Health and Science, 1(1), 1-5, 2021, doi: 10.54543/kesans.v1i1.2.

“Peraturan Menteri Kesehatan Republik Indonesia nomor 24 Tahun 2022 tentang Rekam Medis.”

“Menteri Kesehatan Republik Indonesia Nomor 129/Menkes/SK/II/2008 tentang Standar Pelayanan Minimal Rumah Sakit.”

E. P. Widya Rita, R. Indrawati, and L. Widjaja, “A Service Quality Review of Medical Record Department In Private Hospital, South Jakarta,” Journal of Multidisciplinary Academic 101 JoMA, vol. 05, no. 02, 2021.

Indonesia, “Undang-Undang Nomor 29 Tahun 2004 tentang Praktik Kedokteran,” Jakarta, 2004.

Menteri Kesehatan Republik Indonesia, “Peraturan Menteri Kesehatan Republik Indonesia Nomor 269/MENKES/PER/III/2008 tentang Rekam Medis,” 2008.

Stefan. Büttcher, C. L. A. Clarke, and G. V. Cormack, Information retrieval : implementing and evaluating search engines. MIT Press, 2010.

K. Dalimunthe and B. H. Hayadi, “Information Text Retrieval untuk Pencarian Data Penilaian Mengacu pada Saran dari Pengunjung Menggunakan Vector Space Modelimplementasi.” Journal Computer Science and Information Technology (JCoInT), vol. 5, no.1, 2022.

F. Faridah, K. Munadi, and F. Arnia, “Aplikasi Histogram Discrete Cosine Transform (DCT) untuk Sistem Temu Kembali Citra Termal Berbasis Konten,” Jurnal Nasional Komputasi dan Teknologi Informasi (JNKTI), vol. 2, no. 1, pp. 38–42, 2019.

F. W. Mutinda, S. Yada, S. Wakamiya, and E. Aramaki, “Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT,” Methods Inf Med, vol. 60, pp. E56–E64, Jun. 2021, doi: 10.1055/s-0041-1731390.

S. Henry, C. Cuffy, and B. T. McInnes, “Vector Representations of Multi-Word Terms for Semantic Relatedness,” J Biomed Inform, vol. 77, pp. 111–119, Jan. 2018, doi: 10.1016/j.jbi.2017.12.006.

J. Yubo, D. Xing, W. Yi, and F. Hongdan, “A Document-Based Information Retrieval Model Vector Space,” in 2011 Second International Conference on Networking and Distributed Computing, IEEE, 2011, pp. 65–68. doi: 10.1109/ICNDC.2011.21.

E. Wahyudi, S. Sfenrianto, M. J. Hakim, R. Subandi, O. R. Sulaeman, and R. Setiyawan, “Information Retrieval System for Searching JSON Files with Vector Space Model Method,” in 2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT), IEEE, 2019, pp. 260–265. doi:10.1109/ICAIIT.2019.8834457.

M. A. Rofiqi, Abd. C. Fauzan, A. P. Agustin, and A. A. Saputra, “Implementasi Term-Frequency Inverse Document Frequency (TF-IDF) untuk Mencari Relevansi Dokumen Berdasarkan Query,” ILKOMNIKA: Journal of Computer Science and Applied Informatics, vol. 1, no. 2, pp. 58–64, Dec. 2019, doi: 10.28926/ilkomnika.v1i2.18.

A. T. Adiyanto and D. Handayani, “Information Retrieval Sistem Kearsipan Pencarian Dokumen di Dinas Pemberdayaan Perempuan dan Perlindungan Anak Kota Semarang Menggunakan Metode Vector Space Model,” Jurnal Mahajana Informasi, vol. 7, no. 1, 2022, doi: 10.51544/jurnalmi.v7i1.2538.

R. Noor Santi, S. Eniyati, R. Retnowati, and H. Yulianton, “Penggunaan Sistem Temu Kembali Dalam Pencarian Kata Untuk Terjemahan Al Quran”, Proceeding SENDI_U, pp. 247-252, Jul. 2019.

P. Y. Ristanti, A. P. Wibawa, and U. Pujianto, “Cosine Similarity for Title and Abstract of Economic Journal Classification,” in 2019 5th International Conference on Science in Information Technology (ICSITech), IEEE, 2019, pp. 123–127, doi:10.1109/ICSITech46713.2019.8987547.

M. Tohir, D. Andariya Ningsih, N. Yuli Susanti, A. Umiyah, and L. Fitria, “Comparison of the Performance Results of C4.5 and Random Forest Algorithm in Data Mining to Predict Childbirth Process,” 2023. doi: 10.21512/commit.v17i1.8236.

Munif, M, E. Setyati and Y. Kristian, “Pencarian Tema Sejenis Sinopsis Novel Bahasa Indonesia Dengan Menggunakan GVSM”, Joutica, vol. 6, no. 2, p. 492, Sep. 2021, doi: 10.30736/jti.v6i2.676.

S. Harlina, R. D. Lillikwatil, K. Aryasa, C. Susanto, S. Sapriadi, and E. T. Alfriady, “Klasifikasi Sentimen Tweet Mengenai Covid-19 pada Twitter Di Indonesia Dengan Metode Vector Space Model,” CogITo Smart Journal, vol. 8, no. 2, pp. 422–433, 2022, doi:10.31154/cogito.v8i2.405.422-433.

E. Fitriani, R. E. Indrajit, and R. Aryanti, “Penerapan Model Information Retrieval untuk Pencarian Konten Pada Perpustakaan Digital,” Perspektif: Jurnal Ekonomi dan Manajemen Akademi Bina Sarana Informatika, vol. 15, no. 2, pp. 170–176, 2017.

O. Shahmirzadi, A. Lugowski, and K. Younge, “Text Similarity in Vector Space Models: a Comparative Study,” in 2019 18th IEEE international conference on machine learning and applications (ICMLA), IEEE, 2019, pp. 659–666. doi:10.1109/ICMLA.2019.00120.

I. N. Wiyana, I. N. Purnama, and I. B. K. Sudiatmika, “Analisis Perbandingan Metode Vector Space Model dan Levenshtein Distance Dalam Sistem Temu Kembali Informasi pada Perpustakaan Digital STMIK Primakara (Primakara Library)”, JUTIK, vol. 8, no. 4, Oct. 2022.

K. Andesa, “Penerapan Metode Vector Space Model Pada Komunitas Jaringan Sosial (Studi Kasus Pada STMIK-AMIK Riau),” Sains dan Teknologi Informasi, vol. 1, no. 1, pp. 52–56, 2012.

R. K. Ibrahim, S. R. M. Zeebaree, K. Jacksi, M. A. M. Sadeeq, H. M. Shukur, and A. Alkhayyat, “Clustering Document based Semantic Similarity System using TFIDF and K-Mean,” in 2021 International Conference on Advanced Computer Applications (ACA), IEEE, 2021, pp. 28–33. doi: 10.1109/ACA52198.2021.9626822.

M. Chiny, M. Chihab, O. Bencharef, and Y. Chihab, “Netflix Recommendation System based on TF-IDF and Cosine Similarity Algorithms,” Scitepress, May 2022, pp. 15–20. doi:10.5220/0010727500003101.

G. H. Golub and C. F. Van Loan, “Matrix Computations, baltimore,” The John and Hopkins Press Ltd, p. 81, 1996.

B. S. Lancho-Barrantes and F. J. Cantu-Ortiz, “Quantifying the Publication Preferences of Leading Research Universities,” Scientometrics, vol. 126, no. 3, pp. 2269–2310, Mar. 2021, doi:10.1007/s11192-020-03790-1.

K. Orkphol and W. Yang, “Word Sense Disambiguation using Cosine Similarity Collaborates with Word2vec and WordNet,” Future Internet, vol. 11, no. 5, p. 114, 2019, doi: 10.3390/fi11050114.

D. P. P. Joby, “Expedient Information Retrieval System for Web Pages Using the Natural Language Modeling,” Journal of Artificial Intelligence and Capsule Networks, vol. 2, no. 2, pp. 100–110, 2020, doi: 10.36548/jaicn.2020.2.003.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

    1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
    2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
    3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).