Information Extraction of Compound-Protein Interaction from Scientific Paper using Machine Learning

Aulia Afriza (1), Muhammas Rheza Muztahid (2), - Annisa (3), Wisnu Ananta Kusuma (4)
(1) Department of Computer Sciences, IPB University, Bogor, Indonesia
(2) Department of Computer Sciences, IPB University, Bogor, Indonesia
(3) Department of Computer Sciences, IPB University, Bogor, Indonesia
(4) Department of Computer Sciences, IPB University, Bogor, Indonesia Tropical BiopharDepartment of Computer Sciences, IPB University, Bogor, Indonesia maca Research Center
Fulltext View | Download
How to cite (IJASEIT) :
Afriza, Aulia, et al. “Information Extraction of Compound-Protein Interaction from Scientific Paper Using Machine Learning”. International Journal on Advanced Science, Engineering and Information Technology, vol. 12, no. 2, Apr. 2022, pp. 550-6, doi:10.18517/ijaseit.12.2.13748.
Drug Target Interaction (DTI) is an important process in drug discovery that aims to identify useful compounds in treatment. DTI research is mostly found in databases and literature or papers. To obtain DTI information, another method such as information extraction is required to retrieve information related to DTI interactions. The information in the abstract of the research paper contains many compound sentences. This study performs regular expressions to identify compound sentences, text mining for information extraction, and classification using Bernoulli Naive Bayes. The research uses a collection of abstract documents, where 3.000 abstract documents will be arranged into 29.363 sentences. Sentences that the regular expression has parsed are matched using pattern matching and conducted by text pre-processing. Sentences resulting from text pre-processing stages are used as training datasets. We use 10- fold cross-validation to evaluate the model. This research obtained the best average accuracy value of 0.72 for using naive Bayes without regular expression for compound sentences and 0.76 accuracies for naive Bayes with a regular expression for single sentences. Furthermore, by applying the feature selection process for compound sentence data, we obtained an accuracy of 0.731 for the model without regular expressions and an accuracy of 0.7644 for the model with feature selection using regular expressions.

X. Xia. Bioinformatics and Drug Discovery. Bentham Sciences Publishers. 2017. pp. 1709-1726.

L. Lee and H. Nam. “Identifications of Drug Target Interactions by a Random Walk with Restart Method on an Interaction Network”. BMC Informatics Journal. vol. 19, no. 8, pp. 208. 2017.

R. Chen, X. Liu, S. Jin , J. Lin , and J. Liu. “Machine Learning for Drug Target Interaction Prediction”. Journal Molecules. vol. 23, pp. 2208. 2018.

MU. Maheswari and JGR. Sathiaseelan. “Text Mining: A survey on text mining-techniques and application”.International Journal of Science and Research (IJSR). vol. 6, pp. 1660-1664. 2017.

PY. Lung, T. Zhao, Z. He, and J. Zhang. “Extracting chemical protein interactions from literature”. Florida State University. 2018.

S. Liu et al. “Attention Based Neural Network for Chemical Protein Relation Extraction”. University at Buffalo USA. 2017.

Y. Yamamoto, Y. Matsumoto, and T. Watanabe. “Dependency Patterns of Complex Sentences and Semantic Disambiguation for Abstract Meaning Representation Parsing”. in Proc, Conference on Lexical and Computational Semantics. Bangkok, Thailand. 2021. pp. 212-221

M. Cu et al. “Regular Expressions Based Medical Text Classification using Constructive Heuristic Approch”. in IEEE Access, vol. 7, pp. 147892-147904, 2019, doi: 10.1109/ACCESS.2019.2946622.

A. Konys. “Torwards Knowledge Handling in Ontology- Based Infromation Extraction Systems”. in Procedia Computer Science. vol. 126, pp. 2208-2218. 2018.

C. Zong, R, Xia, and J. Zang. Information Extraction. In: Text Data Mining. Springer, Singapore. 2021.

J. Starvoka, M. Staraka, and J. Hajic. “Neural architectures for nested NER through Linearization”. arXiv preprint arXiv:1908.06926. 2019.

M, Erin et al. “Ensemble Labeling Towards Scientific Information Extraction (ELSIE).” ICCS. 2021.

J, Savoy. “Working with Text Tools, Techniques, and Approches for Text Mining”. Journal of the Assocation for Information Science and Technology. vol. 69. 2017

B. Carpenter. “Lingpipe for 99.99% recall of gene mentions”. Proceedings of the Second BioCreative Challenge Evaluation Workshop, BioCreative, pp. 307-309. 2007.

Z. Zhong et al. “Generating Regular Expressions from Natural Language Specifications: Are We There Yet?”. Association for the Advancement of Artificial Intelligence. 2018.

L. G. Michael, J. Donohue, J. C. Davis, D. Lee and F. Servant, "Regexes are Hard: Decision-Making, Difficulties, and Risks in Regular Programming Expressions," 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2019, pp. 415-426, doi: 10.1109/ASE.2019.00047.

C. Darujati and AB. Gumelar. “Pemanfaatan teknik supervised untuk klasifikasi teks bahasa indonesia”. Jurnal Bandung Text Mining, vol. 16, no 1, ISSN. 1858-4667. 2012.

YH. Kerner, D. Miller and Y. Yigal. “The influence of pre-processing on text classification using a bag-of-words representation.” PLoS ONE. 2020.

D. Abdelhafiz, C. Yang, R. Ammar, and S. Nabavi. “Deep convolutional neural networks for mammography: advances, challenges and applications”. BMC Bioinformatics 20. 2019. https://doi.org/10.1186/s12859-019-2823-4

K. Kowsari et al. “Text Classification Algorithms: A Survey.2019. https://doi.org/10.3390/info10040150.

Heyong and H. Ming. “Supervised Hebb rule-based feature selection for text classification”. in Information Processing & Management. vol. 56, pp. 167-191. 2019.

DM. Powers. “Evaluation: from precision, recall and f-measure to roc, informedness, markedness, and correlation”. International Journal of Machine Learning Technology. vol .2, pp.37-63. 2020.

L. Garcí­a-Bañuelos, N. R. T. P. van Beest, M. Dumas, M. L. Rosa and W. Mertens, "Complete and Interpretable Conformance Checking of Business Processes," in IEEE Transactions on Software Engineering, vol. 44, no. 3, pp. 262-290. 2018, doi: 10.1109/TSE.2017.2668418.

Ghosh, M., Sanyal, G. An ensemble approach to stabilize the features for multi-domain sentiment analysis using supervised machine learning. J Big Data 5, 44. 2018. https://doi.org/10.1186/s40537-018-0152-5.

Liang, L., Ma, C., Du, T. et al. Bioactivity-explorer: a web application for interactive visualization and exploration of bioactivity data. J Cheminform. 2019.

A. Cecile, JW. Janssens, and FK. Martens, Reflection on modern methods: Revisiting the area under the ROC Curve, International Journal of Epidemiology, Volume 49, Issue 4. 2020, Pages 1397-1403.

Nanda, M.A. et al. “A Comparison Study of Kernel Functions in the Support Vector Machine and Its Application for Termite Detection”. 2018.

Authors who publish with this journal agree to the following terms:

    1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
    2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
    3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).