Klasifikasi Teks menggunakan Genetic Programming dengan Implementasi Web Scraping dan Map Reduce
DOI:
https://doi.org/10.29408/edumatic.v6i1.5274Keywords:
genetic programming, klasifikasi teks, map reduceAbstract
Classification of text documents on online media is a big data problem and requires automation. Research has developed a text classification system with pre-processing using map-reduce and web scraping data collection. This study aims to evaluate text classification performance by combining genetic programming algorithms, map-reduce and web scraping for processing large data in the form of text. Data collection was carried out by observing web-based scraping. Data was collected by reducing 8126 duplicates. Map-reduce has tokenized and stopped-word removal with 28507 terms with 4306 unique terms and 24201 duplication terms. Text classification evaluation shows that a single tree produces better accuracy (0.7072) than a decision tree (0.6874), and the lowest is a multi-tree (0.6726). For the acquisition of genetic programming support values with the multi-tree, the highest average support is 0.3854, followed by the decision tree with 0.3584 and the smallest single tree with 0.3494. In general, the amount of support is not in line with the accuracy value achieved.
References
Alsmadi, I., & Hoon, G. K. (2019). Term weighting scheme for short-text classification: Twitter corpuses. Neural Computing and Applications, 31(8), 3819–3831. https://doi.org/10.1007/s00521-017-3298-8
Altınel, B., & Ganiz, M. C. (2018). Semantic text classification: A survey of past and recent advances. Information Processing & Management, 54(6), 1129–1153. https://doi.org/10.1016/j.ipm.2018.08.001
Anjum, B. (2018). MapReduce--The Scalable Distributed Data Processing Solution. In Topics in Parallel and Distributed Computing (pp. 173–190). Springer, Cham.
de Sá, A. G. C., Freitas, A. A., & Pappa, G. L. (2018). Automated selection and configuration of multi-label classification algorithms with grammar-based genetic programming. In International Conference on Parallel Problem Solving from Nature (pp. 308–320). Springer, Cham. https://doi.org/10.1007/978-3-319-99259-4_25
Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools and Applications, 78(3), 3797–3816. https://doi.org/10.1007/s11042-018-6083-5
Du, S., & Li, J. (2019). Parallel processing of improved KNN text classification algorithm based on Hadoop. 2019 7th International Conference on Information, Communication and Networks (ICICN), 167-170. IEEE. https://doi.org/10.1109/ICICN.2019.8834973
Halibas, A. S., Shaffi, A. S., & Mohamed, M. A. K. V. (2018). Application of text classification and clustering of Twitter data for business analytics. Majan International Conference (MIC), 1-7. Oman: IEEE. https://doi.org/10.1109/MINTC.2018.8363162
Jeong, H., & Cha, K. J. (2019). An efficient mapreduce-based parallel processing framework for user-based collaborative filtering. Symmetry, 11(6), 1–8. https://doi.org/10.3390/sym11060748
Koruko, S. (2016). Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications, 57, 232–247. https://doi.org/10.1016/j.eswa.2016.03.045
Kousalya, K., & Parvez, S. J. (2018). Effective processing of unstructured data using python in Hadoop map reduce. Nternational Journal of Engineering & Technology, 7(2.21), 417–419. https://doi.org/10.14419/ijet.v7i2.21.12456
Koutris, P., Salihoglu, S., Suciu, D., & others. (2018). Algorithmic aspects of parallel data processing. Foundations and Trends®in Databases, 8(4), 239–370. https://doi.org/10.1561/1900000055
Pintye, I., Kail, E., Kacsuk, P., & Lovas, R. (2021). Big data and machine learning framework for clouds and its usage for text classification. Concurrency and Computation: Practice and Experience, 33(19), e6164. https://doi.org/10.1002/cpe.6164
Ramsingh, J., & Bhuvaneswari, V. (2018). An efficient Map Reduce-Based Hybrid NBC-TFIDF algorithm to mine the public sentiment on diabetes mellitus--A big data approach. Journal of King Saud University-Computer and Information Sciences, 33(8), 1018–1029. https://doi.org/10.1016/j.jksuci.2018.06.011
Ranjitha, K. V, Prasad, B. S. V., & others. (2020). Optimization Scheme for Text Classification Using Machine Learning Naïve Bayes. In ICDSMLA 2019 (pp. 576–586). Singapore: Springer.
Sihombing, L. O., Hannie, H., & Dermawan, B. A. (2021). Sentimen Analisis Customer Review Produk Shopee Indonesia Menggunakan Algortima Naïve Bayes Classifier. Edumatic: Jurnal Pendidikan Informatika, 5(2), 233–242. https://doi.org/10.29408/edumatic.v5i2.4089
Tahmassebi, A., & Gandomi, A. H. (2018). Genetic programming based on error decomposition: A big data approach. In Genetic programming theory and practice XV (pp. 135–147). Springer. https://doi.org/10.1007/978-3-319-90512-9_9
Telikani, A., Gandomi, A. H., & Shahbahrami, A. (2020). A survey of evolutionary computation for association rule mining. Information Sciences, 524, 318–352. https://doi.org/10.1016/j.ins.2020.02.073
Thomas, D. M., & Mathur, S. (2019, June). Data analysis by web scraping using python. In 2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA) (pp. 450-454). India: IEEE. https://doi.org/10.1109/ICECA.2019.8822022
Tran, B., Xue, B., & Zhang, M. (2019). Genetic programming for multiple-feature construction on high-dimensional classification. Pattern Recognition, 93, 404-417. https://doi.org/10.1016/j.patcog.2019.05.006
Viegas, F., Rocha, L., Gonçalves, M., Mourão, F., Sá, G., Salles, T., Andrade, G., & Sandin, I. (2018). A genetic programming approach for feature selection in highly dimensional skewed data. Neurocomputing, 273, 554–569. https://doi.org/10.1016/j.neucom.2017.08.050
Downloads
Additional Files
Published
How to Cite
Issue
Section
License
Semua tulisan pada jurnal ini adalah tanggung jawab penuh penulis. Edumatic: Jurnal Pendidikan Informatika bisa diakses secara free (gratis) tanpa ada pungutan biaya, sesuai dengan lisensi creative commons yang digunakan.
This work is licensed under a Lisensi a Creative Commons Attribution-ShareAlike 4.0 International License.