Enhanced Manhattan-based Clustering using Fuzzy C-Means Algorithm for High Dimensional Datasets

Joven A. Tolentino (1), Bobby D. Gerardo (2)
(1) Technological Institute of the Philippine and Tarlac Agricultural University
(2) WEST VISAYAS STATE UNIVERSITY
Fulltext View | Download
How to cite (IJASEIT) :
Tolentino, Joven A., and Bobby D. Gerardo. “Enhanced Manhattan-Based Clustering Using Fuzzy C-Means Algorithm for High Dimensional Datasets”. International Journal on Advanced Science, Engineering and Information Technology, vol. 9, no. 3, May 2019, pp. 766-71, doi:10.18517/ijaseit.9.3.6005.
The problem of mining a high dimensional data includes a high computational cost, a high dimensional dataset composed of thousands of attribute and or instances. The efficiency of an algorithm, specifically, its speed is oftentimes sacrificed when this kind of dataset is supplied to the algorithm. Fuzzy C-Means algorithm is one which suffers from this problem. This clustering algorithm requires high computational resources as it processes whether low or high dimensional data. Netflix data rating, small round blue cell tumors (SRBCTs) and Colon Cancer (52,308, and 2,000 of attributes and 1500, 83 and 62 of instances respectively) dataset were identified as a high dimensional dataset. As such, the Manhattan distance measure employing the trigonometric function was used to enhance the fuzzy c-means algorithm. Results show an increase on the efficiency of processing large amount of data using the Netflix ,Colon cancer and SRCBT an (39,296, 38,952 and 85,774 milliseconds to complete the different clusters, respectively) average of 54,674 milliseconds while Manhattan distance measure took an average of (36,858, 36,501 and 82,86 milliseconds, respectively)  52,703 milliseconds for the entire dataset to cluster. On the other hand, the enhanced Manhattan distance measure took (33,216, 32,368 and 81,125 milliseconds, respectively) 48,903 seconds on clustering the datasets. Given the said result, the enhanced Manhattan distance measure is 11% more efficient compared to Euclidean distance measure and 7% more efficient than the Manhattan distance measure respectively.

N. Raksha and R. Alankar, “Detection of fuzzy duplicates in high dimensional datasets,” 2016 Int. Conf. Adv. Comput. Commun. Informatics, ICACCI 2016, pp. 1423-1428, 2016.

Y. G. Jung, M. S. Kang, and J. Heo, “Clustering performance comparison using K- means and expectation maximization algorithms,” Biotechnol. Biotechnol. Equip. ISSN1310-2818, vol. 2818, no. October, 2015.

Z. Cebeci and F. Yildiz, “Comparison of K-Means and Fuzzy C-Means Algorithms on Different Cluster Structures,” J. Agric. Informatics, vol. 6, no. 3, pp. 13-23, 2015.

R. Winkler, F. Klawonn, and R. Kruse, “Problems of Fuzzy c-Means Clustering and Similar Algorithms with High Dimensional Data Sets,” Challenges Interface Data Anal. Comput. Sci. Optim., pp. 1-8, 2012.

S. Pandit and S. Gupta, “A COMPARATIVE STUDY ON DISTANCE MEASURING,” Int. J. Res. Comput. Sci., vol. 2, no. 1, pp. 29-31, 2011.

T. K. Mohana, V. Lalitha, L. Kusuma, N. Rahul, and M. Mohan, “Various Distance Metric Methods for Query Based Image Retrieval,” vol. 7, no. 3, pp. 5818-5821, 2017.

M. Khan and T. Shah, “A copyright protection using watermarking scheme based on nonlinear permutation and its quality metrics,” Neural Comput. Appl., vol. 26, no. 4, pp. 845-855, 2014.

U. Fayyad, G. Piatetsky-shapiro, and P. Smyth, “From Data Mining to Knowledge Discovery in,” vol. 17, no. 3, pp. 37-54, 1996.

J. Zhang and M. Pan, “A high-dimension two-sample test for the mean using cluster,” Comput. Stat. Data Anal., vol. 97, pp. 87-97, 2016.

L. Zhou, “Preprocessing Method before Data Compression of Cloud Platform,” pp. 1223-1227, 2017.

M. A. Chaudhari, P. M. Phadatare, P. S. Kudale, R. B. Mohite, and R. P. Petare, “Preprocessing of High Dimensional Dataset for Developing Expert IR System,” pp. 417-421, 2015.

Z. Marzuki and F. Ahmad, “Data Mining Discretization Methods and Performances Data Mining Discretization Methods and Performances,” no. December, pp. 3-6, 2014.

N. A. Mian and N. A. Zafar, “Key Analysis of Normalization Process using Formal Techniques in DBRE,” 2010.

C. Ordonez, “Data Set Preprocessing and Transformation in a Database System,” vol. 15, no. 4, pp. 1-19, 2011.

Z. Wang, N. Zhao, W. Wang, R. Tang, and S. Li, “A Fault Diagnosis Approach for Gas Turbine Exhaust Gas Temperature Based on Fuzzy C-Means Clustering and Support Vector Machine,” Math. Probl. Eng., vol. 2015, pp. 1-11, 2015.

N. Grover, “A study of various Fuzzy Clustering Algorithms,” Int. J. Eng. Res., vol. 5013, no. 3, pp. 177-181, 2014.

L. H. Son, “Generalized picture distance measure and applications to picture fuzzy clustering,” Appl. Soft Comput. J., pp. 1-12, 2016.

C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the Surprising Behavior of Distance Metrics in High Dimensional Space,” pp. 420-434, 2001.

A. B. Rathod, “A Comparative Study on Distance Measuring Approches for Permutation Representations,” pp. 251-255, 2016.

Authors who publish with this journal agree to the following terms:

    1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
    2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
    3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).