Klasifikasi Teks menggunakan Genetic Programming dengan Implementasi Web Scraping dan Map Reduce

Wirarama Wedashwara; Andy Hidayat; Budi Irmawati; Ariyan Zubaidi

doi:10.29408/edumatic.v6i1.5274

Authors

Wirarama Wedashwara Program Studi Teknik Informatika, Universitas Mataram https://orcid.org/0000-0002-3716-1620
Andy Hidayat Program Studi Teknik Informatika, Universitas Mataram https://orcid.org/0000-0002-8272-957X
Budi Irmawati Program Studi Teknik Informatika, Universitas Mataram https://orcid.org/0000-0002-4178-1486
Ariyan Zubaidi Program Studi Teknik Informatika, Universitas Mataram https://orcid.org/0000-0002-2139-9749

DOI:

https://doi.org/10.29408/edumatic.v6i1.5274

Keywords:

genetic programming, klasifikasi teks, map reduce

Abstract

Classification of text documents on online media is a big data problem and requires automation. Research has developed a text classification system with pre-processing using map-reduce and web scraping data collection. This study aims to evaluate text classification performance by combining genetic programming algorithms, map-reduce and web scraping for processing large data in the form of text. Data collection was carried out by observing web-based scraping. Data was collected by reducing 8126 duplicates. Map-reduce has tokenized and stopped-word removal with 28507 terms with 4306 unique terms and 24201 duplication terms. Text classification evaluation shows that a single tree produces better accuracy (0.7072) than a decision tree (0.6874), and the lowest is a multi-tree (0.6726). For the acquisition of genetic programming support values with the multi-tree, the highest average support is 0.3854, followed by the decision tree with 0.3584 and the smallest single tree with 0.3494. In general, the amount of support is not in line with the accuracy value achieved.

References

Alsmadi, I., & Hoon, G. K. (2019). Term weighting scheme for short-text classification: Twitter corpuses. Neural Computing and Applications, 31(8), 3819–3831. https://doi.org/10.1007/s00521-017-3298-8

Altınel, B., & Ganiz, M. C. (2018). Semantic text classification: A survey of past and recent advances. Information Processing & Management, 54(6), 1129–1153. https://doi.org/10.1016/j.ipm.2018.08.001

Anjum, B. (2018). MapReduce--The Scalable Distributed Data Processing Solution. In Topics in Parallel and Distributed Computing (pp. 173–190). Springer, Cham.

de Sá, A. G. C., Freitas, A. A., & Pappa, G. L. (2018). Automated selection and configuration of multi-label classification algorithms with grammar-based genetic programming. In International Conference on Parallel Problem Solving from Nature (pp. 308–320). Springer, Cham. https://doi.org/10.1007/978-3-319-99259-4_25

Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools and Applications, 78(3), 3797–3816. https://doi.org/10.1007/s11042-018-6083-5

Du, S., & Li, J. (2019). Parallel processing of improved KNN text classification algorithm based on Hadoop. 2019 7th International Conference on Information, Communication and Networks (ICICN), 167-170. IEEE. https://doi.org/10.1109/ICICN.2019.8834973

Halibas, A. S., Shaffi, A. S., & Mohamed, M. A. K. V. (2018). Application of text classification and clustering of Twitter data for business analytics. Majan International Conference (MIC), 1-7. Oman: IEEE. https://doi.org/10.1109/MINTC.2018.8363162

Jeong, H., & Cha, K. J. (2019). An efficient mapreduce-based parallel processing framework for user-based collaborative filtering. Symmetry, 11(6), 1–8. https://doi.org/10.3390/sym11060748

Koruko, S. (2016). Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications, 57, 232–247. https://doi.org/10.1016/j.eswa.2016.03.045

Kousalya, K., & Parvez, S. J. (2018). Effective processing of unstructured data using python in Hadoop map reduce. Nternational Journal of Engineering & Technology, 7(2.21), 417–419. https://doi.org/10.14419/ijet.v7i2.21.12456

Koutris, P., Salihoglu, S., Suciu, D., & others. (2018). Algorithmic aspects of parallel data processing. Foundations and Trends®in Databases, 8(4), 239–370. https://doi.org/10.1561/1900000055

Pintye, I., Kail, E., Kacsuk, P., & Lovas, R. (2021). Big data and machine learning framework for clouds and its usage for text classification. Concurrency and Computation: Practice and Experience, 33(19), e6164. https://doi.org/10.1002/cpe.6164

Ramsingh, J., & Bhuvaneswari, V. (2018). An efficient Map Reduce-Based Hybrid NBC-TFIDF algorithm to mine the public sentiment on diabetes mellitus--A big data approach. Journal of King Saud University-Computer and Information Sciences, 33(8), 1018–1029. https://doi.org/10.1016/j.jksuci.2018.06.011

Ranjitha, K. V, Prasad, B. S. V., & others. (2020). Optimization Scheme for Text Classification Using Machine Learning Naïve Bayes. In ICDSMLA 2019 (pp. 576–586). Singapore: Springer.

Sihombing, L. O., Hannie, H., & Dermawan, B. A. (2021). Sentimen Analisis Customer Review Produk Shopee Indonesia Menggunakan Algortima Naïve Bayes Classifier. Edumatic: Jurnal Pendidikan Informatika, 5(2), 233–242. https://doi.org/10.29408/edumatic.v5i2.4089

Tahmassebi, A., & Gandomi, A. H. (2018). Genetic programming based on error decomposition: A big data approach. In Genetic programming theory and practice XV (pp. 135–147). Springer. https://doi.org/10.1007/978-3-319-90512-9_9

Telikani, A., Gandomi, A. H., & Shahbahrami, A. (2020). A survey of evolutionary computation for association rule mining. Information Sciences, 524, 318–352. https://doi.org/10.1016/j.ins.2020.02.073

Thomas, D. M., & Mathur, S. (2019, June). Data analysis by web scraping using python. In 2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA) (pp. 450-454). India: IEEE. https://doi.org/10.1109/ICECA.2019.8822022

Tran, B., Xue, B., & Zhang, M. (2019). Genetic programming for multiple-feature construction on high-dimensional classification. Pattern Recognition, 93, 404-417. https://doi.org/10.1016/j.patcog.2019.05.006

Viegas, F., Rocha, L., Gonçalves, M., Mourão, F., Sá, G., Salles, T., Andrade, G., & Sandin, I. (2018). A genetic programming approach for feature selection in highly dimensional skewed data. Neurocomputing, 273, 554–569. https://doi.org/10.1016/j.neucom.2017.08.050