Analisa Performa Klastering Data Besar pada Hadoop

Hadian Mandala Putra, Taufik Akbar, Ahwan Ahmadi, Muhammad Iman Darmawan


Big Data is a collection of data with a large and complex size, consisting of various data types and obtained from various sources, overgrowing quickly. Some of the problems that will arise when processing big data, among others, are related to the storage and access of big data, which consists of various types of data with high complexity that are not able to be handled by the relational model. One technology that can solve the problem of storing and accessing big data is Hadoop. Hadoop is a technology that can store and process big data by distributing big data into several data partitions (data blocks). Problems arise when an analysis process requires all data spread out into one data entity, for example, in the data clustering process. One alternative solution is to do a parallel and scattered analysis, then perform a centralized analysis of the results of the scattered analysis. This study examines and analyzes two methods, namely K-Medoids Mapreduce and K-Modes without Mapreduce. The dataset used is a dataset about cars consisting of 3.5 million rows of data with 400MB distributed in a Hadoop Cluster (consisting of more than one engine). Hadoop has a MapReduce feature, consisting of 2 functions, namely map and reduce. The map function performs a selection to retrieve a key, value pairs, and returns a value in the form of a collection of key value pairs, and then the reduce function combines all key value pairs from several map functions. The results of the cluster quality evaluation are tested using the Silhouette Coefficient testing metric. The K-Medoids MapReduce algorithm for the car dataset gives a silhouette value of 0.99 with a total of 2 clusters.


Big Data; Hadoop; Mapreduce; Clustering

Full Text:



R. Elmasri and S. B. Navathe, “Fundamentals of Database Systems 4th edition,†Database, 2003.

Y. Hajjaji and I. R. Farah, “Performance investigation of selected NoSQL databases for massive remote sensing image data storage,†in 2018 4th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Mar. 2018.

S. Dhanasekaran, R. Sundarrajan, B. S. Murugan, S. Kalaivani, and V. Vasudevan, “Enhanced Map Reduce Techniques for Big Data Analytics based on K-Means Clustering,†IEEE Int. Conf. Intell. Tech. Control. Optim. Signal Process. INCOS 2019, pp. 1–5, 2019.

L. Wang, E. Zou, C. Zeng, X. Xi, and Y. Lu, “Research and Implementation of Big Data Clustering Based on Spark,†Shuju Caiji Yu Chuli/Journal Data Acquis. Process., vol. 33, no. 6, pp. 1077–1085, 2018.

H. Liu, F. Huang, H. Li, W. Liu, and T. Wang, “A Big Data Framework for Electric Power Data Quality Assessment,†in 2017 14th Web Information Systems and Applications Conference (WISA), 2017..

C. S. Kim and S. B. Son, “A Study on Big Data Cluster in Smart Factory using Raspberry-Pi,†Proc. - 2018 IEEE Int. Conf. Big Data, Big Data 2018.

P. Ramadar Noor Saputra and A. Chusyairi, “Perbandingan Metode Clustering dalam Pengelompokan Data Puskesmas,†J. Rekayasa Sist. dan Teknol. Inf., vol. 4, no. 6, pp. 1077–1084, 2020.

C. Verma and R. Pandey, “Big Data representation for grade analysis through Hadoop framework,†Proc. 2016 6th Int. Conf. - Cloud Syst. Big Data Eng. Conflu. 2016, pp. 312–315, 2016..

C. Kaushal and D. Koundal, “Recent trends in big data using hadoop,†Int. J. Informatics Commun. Technol., vol. 8, no. 1, p. 39, 2019.

M. B. Masadeh, M. S. Azmi, and S. S. S. Ahmad, “Available techniques in hadoop small file issue,†Int. J. Electr. Comput. Eng., vol. 10, no. 2, pp. 2097–2101, 2020.

D. C. Vinutha and G. T. Raju, “An accurate and efficient scheduler for hadoop mapreduce framework,†Indones. J. Electr. Eng. Comput. Sci., vol. 12, no. 3, pp. 1132–1142, 2018.

N. A. ERILLI, “Comparison of fuzzy clustering methods in economic freedom ranking in Asia-Pacific,†J. Perspekt. Pembiayaan dan Pembang. Drh., vol. 7, no. 2, pp. 157–168, 2019.

E. Setyowati, A. Rusgiyono, and M. A. Mukid, “Analisis Pengelompokan Daerah Menggunakan Metode Non-Hierarchical Partitioning K-Medoids Dari Hasil Komoditas Pertanian Tanaman Pangan,†J. Gaussian, vol. 4, no. 4, pp. 825–836, 2015.

Z. Mustofa and I. S. Suasana, “Algoritma Clustering K-Medoids Pada E-Government Bidang Information And Communication,†J. Teknol. dan Komun., vol. 9, pp. 1–10, 2018.

E. Okta, N. Satyahadewi, and N. N. Debataraja, “Penerapan Metode K-Medoids Pada Pengelompokan,†vol. 08, no. 4, pp. 813–820, 2019.



  • There are currently no refbacks.

Copyright (c) 2021 Infotek : Jurnal Informatika dan Teknologi

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

View My Stats