Soft Set Multivariate Distribution for Categorical Data Clustering

Iwan Tri Riyadi Yanto (1), Rohmat Saedudin (2), Sely Novita Sari (3), Mustafa Mat Deris (4), Norhalina Senan (5)
(1) Department of Information Systems, University of Ahmad Dahlan, Yogyakarta, Indonesia
(2) Department of Information Systems, Telkom University, Bandung, West Java, Indonesia
(3) Faculty of Civil Engineering and Planning, Institute Teknologi Nasional Yogyakarta, Indonesia
(4) Faculty of Applied Science and Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Batu Pahat, Johor, Malaysia
(5) Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Batu Pahat, Johor, Malaysia
Fulltext View | Download
How to cite (IJASEIT) :
Yanto, Iwan Tri Riyadi, et al. “Soft Set Multivariate Distribution for Categorical Data Clustering”. International Journal on Advanced Science, Engineering and Information Technology, vol. 11, no. 5, Oct. 2021, pp. 1841-6, doi:10.18517/ijaseit.11.5.15420.
Clustering is the process of breaking down a huge dataset into smaller groups. It has been used in some field studies including pattern recognition, segmentation, and statistics with remarkable success. Clustering is a technique for dividing multivariate datasets into groups. No inherent distance measure on data category makes clustering data more challenging than numerical data. Data category can be assumed following the data from a multinomial distribution. Thus, the standard model parametric model can be used in latent class clustering based on the independent product of multinomial distributions. Meanwhile, multi-valued attributes on the categorical data can be decomposed into the standard set on a multi soft set. In this paper, a clustering technique based on soft set theory is proposed for categorical data through a multinomial distribution. The data will be represented as a multi soft set which is every soft set has its probability of being a member of the cluster. The data with the highest probability will be assigned as the member of the cluster. The experiment of the proposed technique is evaluated based on the Dunn index with regard to the number of clusters and response time. The experiment results show that the proposed technique has the lowest response time with high stability compared to baseline techniques. This study recommends a maximum number of clusters in implementation on the real data. 

C. Wan, M. Ye, C. Yao, and C. Wu, “Brain MR image segmentation based on Gaussian filtering and improved FCM clustering algorithm,” in 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), 2017, pp. 1-5.

R. Shanker and M. Bhattacharya, “Brain Tumor Segmentation of Normal and Pathological Tissues Using K-mean Clustering with Fuzzy C-mean Clustering,” in VipIMAGE 2017, 2018, pp. 286-296.

A. S. M. S. Hossain, “Customer segmentation using centroid based and density based clustering algorithms,” in 2017 3rd International Conference on Electrical Information and Communication Technology (EICT), 2017, pp. 1-6.

K. V Ahammed Muneer and K. Paul Joseph, “Performance Analysis of Combined k-mean and Fuzzy-c-mean Segmentation of MR Brain Images,” in Computational Vision and Bio Inspired Computing, 2018, pp. 830-836.

H. Zhou, “K-Means Clustering BT - Learn Data Mining Through Excel: A Step-by-Step Approach for Understanding Machine Learning Methods,” H. Zhou, Ed. Berkeley, CA: Apress, 2020, pp. 35-47.

S. Irfan, G. Dwivedi, and S. Ghosh, “Optimization of K-means clustering using genetic algorithm,” in 2017 International Conference on Computing and Communication Technologies for Smart Nation (IC3TSN), 2017, pp. 156-161.

B. K. D. Prasad, B. Choudhary, and B. Ankayarkanni., “Performance Evaluation Model using Unsupervised K-Means Clustering,” in 2020 International Conference on Communication and Signal Processing (ICCSP), 2020, pp. 1456-1458.

W. Wei, J. Liang, X. Guo, P. Song, and Y. Sun, “Hierarchical division clustering framework for categorical data,” Neurocomputing, vol. 341, pp. 118-134, 2019.

Z. Huang, “Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,” Data Min. Knowl. Discov., vol. 2, no. 3, pp. 283-304, 1998.

Y. Xiao, C. Huang, J. Huang, I. Kaku, and Y. Xu, “Optimal mathematical programming and variable neighborhood search for k-modes categorical data clustering,” Pattern Recognit., vol. 90, pp. 183-195, 2019.

D. B. M. Maciel, G. J. A. Amaral, R. M. C. R. de Souza, and B. A. Pimentel, “Multivariate fuzzy k-modes algorithm,” Pattern Anal. Appl., vol. 20, no. 1, pp. 59-71, 2017.

P. S. Bishnu and V. Bhattacherjee, “Software cost estimation based on modified K-Modes clustering Algorithm,” Nat. Comput., vol. 15, no. 3, pp. 415-422, 2016.

Z. Huang and M. K. Ng, “A fuzzy k-modes algorithm for clustering categorical data,” IEEE Trans. Fuzzy Syst., vol. 7, no. 4, pp. 446-452, 1999.

M. S. Yang, Y. H. Chiang, C. C. Chen, and C. Y. Lai, “A fuzzy k-partitions model for categorical data and its comparison to the GoM model,” Fuzzy Sets Syst., vol. 159, no. 4, pp. 390-405, 2008.

A. Karim, C. Loqman, and J. Boumhidi, “Determining the number of clusters using neural network and max stable set problem,” Procedia Comput. Sci., vol. 127, pp. 16-25, 2018.

S. Ben-David, D. Pí¡l, and H. Simon, Stability of k-Means Clustering. 2007.

I. Landi, V. Mandelli, and M. V. Lombardo, “reval: a Python package to determine the best number of clusters with stability-based relative clustering validation,” arXiv, vol. 2, no. 4. arXiv, p. 100228, 27-Aug-2020.

D. G. L. Allegretti, “Stability conditions, cluster varieties, and Riemann-Hilbert problems from surfaces,” Adv. Math. (N. Y)., vol. 380, p. 107610, Mar. 2021.

E. Andreotti, D. Edelmann, N. Guglielmi, and C. Lubich, “Measuring the stability of spectral clustering,” Linear Algebra Appl., vol. 610, pp. 673-697, Feb. 2021.

T. Herawan and M. M. Deris, “On Multi-soft Sets Construction in Information Systems BT - Emerging Intelligent Computing Technology and Applications. With Aspects of Artificial Intelligence,” 2009, pp. 101-110.

D. S. Morris, A. M. Raim, and K. F. Sellers, “A Conway-Maxwell-multinomial distribution for flexible modeling of clustered categorical data,” J. Multivar. Anal., vol. 179, p. 104651, 2020.

D.-W. Kim, K. H. Lee, and D. Lee, “Fuzzy clustering of categorical data using fuzzy centroids,” Pattern Recognit. Lett., vol. 25, no. 11, pp. 1263-1271, Aug. 2004.

Authors who publish with this journal agree to the following terms:

    1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
    2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
    3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).