Analisa Performa Klastering Data Besar pada Hadoop

Hadian Mandala Putra; Taufik Akbar; Ahwan Ahmadi; Muhammad Iman Darmawan

doi:10.29408/jit.v4i2.3565

Analisa Performa Klastering Data Besar pada Hadoop

Authors

Hadian Mandala Putra Universitas Hamzanwadi http://orcid.org/0000-0002-9807-9989
Taufik Akbar Universitas Hamzanwadi
Ahwan Ahmadi Universitas Hamzanwadi
Muhammad Iman Darmawan Universitas Hamzanwadi

DOI:

https://doi.org/10.29408/jit.v4i2.3565

Keywords:

Big Data, Hadoop, Mapreduce, Clustering

Abstract

Big Data is a collection of data with a large and complex size, consisting of various data types and obtained from various sources, overgrowing quickly. Some of the problems that will arise when processing big data, among others, are related to the storage and access of big data, which consists of various types of data with high complexity that are not able to be handled by the relational model. One technology that can solve the problem of storing and accessing big data is Hadoop. Hadoop is a technology that can store and process big data by distributing big data into several data partitions (data blocks). Problems arise when an analysis process requires all data spread out into one data entity, for example, in the data clustering process. One alternative solution is to do a parallel and scattered analysis, then perform a centralized analysis of the results of the scattered analysis. This study examines and analyzes two methods, namely K-Medoids Mapreduce and K-Modes without Mapreduce. The dataset used is a dataset about cars consisting of 3.5 million rows of data with 400MB distributed in a Hadoop Cluster (consisting of more than one engine). Hadoop has a MapReduce feature, consisting of 2 functions, namely map and reduce. The map function performs a selection to retrieve a key, value pairs, and returns a value in the form of a collection of key value pairs, and then the reduce function combines all key value pairs from several map functions. The results of the cluster quality evaluation are tested using the Silhouette Coefficient testing metric. The K-Medoids MapReduce algorithm for the car dataset gives a silhouette value of 0.99 with a total of 2 clusters.

References

R. Elmasri and S. B. Navathe, â€œFundamentals of Database Systems 4th edition,â€ Database, 2003.

Y. Hajjaji and I. R. Farah, â€œPerformance investigation of selected NoSQL databases for massive remote sensing image data storage,â€ in 2018 4th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Mar. 2018.

S. Dhanasekaran, R. Sundarrajan, B. S. Murugan, S. Kalaivani, and V. Vasudevan, â€œEnhanced Map Reduce Techniques for Big Data Analytics based on K-Means Clustering,â€ IEEE Int. Conf. Intell. Tech. Control. Optim. Signal Process. INCOS 2019, pp. 1â€“5, 2019.

L. Wang, E. Zou, C. Zeng, X. Xi, and Y. Lu, â€œResearch and Implementation of Big Data Clustering Based on Spark,â€ Shuju Caiji Yu Chuli/Journal Data Acquis. Process., vol. 33, no. 6, pp. 1077â€“1085, 2018.

H. Liu, F. Huang, H. Li, W. Liu, and T. Wang, â€œA Big Data Framework for Electric Power Data Quality Assessment,â€ in 2017 14th Web Information Systems and Applications Conference (WISA), 2017..

C. S. Kim and S. B. Son, â€œA Study on Big Data Cluster in Smart Factory using Raspberry-Pi,â€ Proc. - 2018 IEEE Int. Conf. Big Data, Big Data 2018.

P. Ramadar Noor Saputra and A. Chusyairi, â€œPerbandingan Metode Clustering dalam Pengelompokan Data Puskesmas,â€ J. Rekayasa Sist. dan Teknol. Inf., vol. 4, no. 6, pp. 1077â€“1084, 2020.

C. Verma and R. Pandey, â€œBig Data representation for grade analysis through Hadoop framework,â€ Proc. 2016 6th Int. Conf. - Cloud Syst. Big Data Eng. Conflu. 2016, pp. 312â€“315, 2016..

C. Kaushal and D. Koundal, â€œRecent trends in big data using hadoop,â€ Int. J. Informatics Commun. Technol., vol. 8, no. 1, p. 39, 2019.

M. B. Masadeh, M. S. Azmi, and S. S. S. Ahmad, â€œAvailable techniques in hadoop small file issue,â€ Int. J. Electr. Comput. Eng., vol. 10, no. 2, pp. 2097â€“2101, 2020.

D. C. Vinutha and G. T. Raju, â€œAn accurate and efficient scheduler for hadoop mapreduce framework,â€ Indones. J. Electr. Eng. Comput. Sci., vol. 12, no. 3, pp. 1132â€“1142, 2018.

N. A. ERILLI, â€œComparison of fuzzy clustering methods in economic freedom ranking in Asia-Pacific,â€ J. Perspekt. Pembiayaan dan Pembang. Drh., vol. 7, no. 2, pp. 157â€“168, 2019.

E. Setyowati, A. Rusgiyono, and M. A. Mukid, â€œAnalisis Pengelompokan Daerah Menggunakan Metode Non-Hierarchical Partitioning K-Medoids Dari Hasil Komoditas Pertanian Tanaman Pangan,â€ J. Gaussian, vol. 4, no. 4, pp. 825â€“836, 2015.

Z. Mustofa and I. S. Suasana, â€œAlgoritma Clustering K-Medoids Pada E-Government Bidang Information And Communication,â€ J. Teknol. dan Komun., vol. 9, pp. 1â€“10, 2018.

E. Okta, N. Satyahadewi, and N. N. Debataraja, â€œPenerapan Metode K-Medoids Pada Pengelompokan,â€ vol. 08, no. 4, pp. 813â€“820, 2019.

Downloads

Published

31-07-2021

How to Cite

Mandala Putra, H., Akbar, T., Ahmadi, A., & Iman Darmawan, M. (2021). Analisa Performa Klastering Data Besar pada Hadoop. Infotek: Jurnal Informatika Dan Teknologi, 4(2), 174–183. https://doi.org/10.29408/jit.v4i2.3565

Download Citation

Issue

Vol. 4 No. 2 (2021): Infotek : Jurnal Informatika dan Teknologi

Section

Articles

License

Semua tulisan pada jurnal ini menjadi tanggung jawab penuh penulis. Jurnal Infotek memberikan akses terbuka terhadap siapapun agar informasi dan temuan pada artikel tersebut bermanfaat bagi semua orang. Jurnal Infotek ini dapat diakses dan diunduh secara gratis, tanpa dipungut biaya sesuai dengan lisense creative commons yang digunakan.