International Journal on Advanced Science, Engineering and Information Technology, Vol. 11 (2021) No. 6, pages: 2534-2542, DOI:10.18517/ijaseit.11.6.14345

Generation of a Synthetic Dataset for the Study of Fraud through Deep Learning Techniques

Marco Sánchez, Verónica Olmedo, Carlos Narvaez, Myriam Hernández, Luis Urquiza-Aguiar

Abstract

Fraud is defined as any purposeful or deliberate act including cunning, deception, or other unfair means to deprive someone of property or money. Nowadays, fraud-related activities are growing at a dizzying rate, causing substantial economic losses every year. For an adequate analysis of this phenomenon, it is necessary to have data that evidences this behavior. Even so, given that these data are scarce and difficult to find, generating synthetic data for their study is a viable option. We designed two algorithms to generate text to create a synthetic data set that allows fraud analysis. These algorithms rely on the Fraud Triangle Theory proposed by Donald R. Cressey and use Recurrent Neural Network (RNN) and Long Short-Term Memory Networks (LSTM), respectively. The datasets generated were analyzed from the semantic point of view, giving a score about their readability and grammar consistency. The results obtained from this evaluation indicate that the data generation architecture proposed using the LSTM algorithm provides better performance in sentence readability (efficiency greater than 70%) than RNN (less than 40%). With LSTM, it was possible to synthesize a comprehensive data set related to the fraud triangle's vertices.  This will make it easier to investigate fraudulent actions that are linked to human behavior. We will present a fraud predictor system based on machine learning techniques in the future.

Keywords:

Fraud triangle theory; machine learning; deep learning; LSTM; RNN.

Viewed: 128 times (since abstract online)

cite this paper     download