International Journal on Advanced Science, Engineering and Information Technology, Vol. 11 (2021) No. 5, pages: 1841-1846, DOI:10.18517/ijaseit.11.5.15420

Soft Set Multivariate Distribution for Categorical Data Clustering

Iwan Tri Riyadi Yanto, Rohmat Saedudin, Sely Novita Sari, Mustafa Mat Deris, Norhalina Senan

Abstract

Clustering is the process of breaking down a huge dataset into smaller groups. It has been used in some field studies including pattern recognition, segmentation, and statistics with remarkable success. Clustering is a technique for dividing multivariate datasets into groups. No inherent distance measure on data category makes clustering data more challenging than numerical data. Data category can be assumed following the data from a multinomial distribution. Thus, the standard model parametric model can be used in latent class clustering based on the independent product of multinomial distributions. Meanwhile, multi-valued attributes on the categorical data can be decomposed into the standard set on a multi soft set. In this paper, a clustering technique based on soft set theory is proposed for categorical data through a multinomial distribution. The data will be represented as a multi soft set which is every soft set has its probability of being a member of the cluster. The data with the highest probability will be assigned as the member of the cluster. The experiment of the proposed technique is evaluated based on the Dunn index with regard to the number of clusters and response time. The experiment results show that the proposed technique has the lowest response time with high stability compared to baseline techniques. This study recommends a maximum number of clusters in implementation on the real data. 

Keywords:

Clustering; categorical data; soft set; multivariate.

Viewed: 94 times (since abstract online)

cite this paper     download