Transformasi Digital Pengelolaan Metadata Jurnal: Studi Eksperimental Otomasi Entri Data berbasis OCR

Authors

DOI:

https://doi.org/10.29408/edumatic.v9i3.32631

Keywords:

accuracy, efficiency, data entry, journal metadata, ocr

Abstract

Manual entry of journal metadata often requires substantial time and is prone to errors, potentially hindering indexing processes and reducing editorial workflow efficiency. Optical character recognition (OCR) offers an automated approach that may accelerate this process, yet its performance on densely structured journal metadata has not been extensively examined. This study aims to evaluate the time efficiency and accuracy of OCR-based metadata entry and compare it with manual methods. Metadata from the arxiv-metadata-oai-snapshot were rendered into visual documents and processed under three workload scenarios consisting of 100, 500, and 1,000 entries. Two metrics were analyzed: processing time and record accuracy. The results reveal a substantial time difference between the two workflows; in the 1,000-entry scenario, manual entry required approximately 50,000 seconds, whereas OCR completed the task in only 0.075 seconds. Manual accuracy remained stable at 88%, while the automated approach achieved 97%, although character-level accuracy on visual documents ranged only from 1–3%, reflecting the complexity of journal metadata structures. These findings indicate that OCR can serve effectively as an initial stage of metadata automation but still requires human verification through a Human-in-the-Loop approach to maintain data integrity.

References

Aydın Çolak, F., & Eroğlu, Ş. (2025). Evaluating metadata quality in institutional academic repositories of Turkish research universities. Online Information Review, 49(7), 1335-1350. https://doi.org/10.1108/OIR-06-2024-0401

Dutta, H., & Gupta, A. (2022). PNRank: Unsupervised ranking of person name entities from noisy OCR text. Decision Support Systems, 152, 113662. https://doi.org/10.1016/j.dss.2021.113662

Ignasius, A., Chandra, J. C., Oscadinata, R., & Suhartono, D. (2023). Image pre-processing effect on OCR’s performance for image conversion to Braille Unicode. Procedia Computer Science, 227, 922–931. https://doi.org/10.1016/j.procs.2023.10.599

Irimia, C., Harbuzariu, F., Hazi, I., & Iftene, A. (2022). Official document identification and data extraction using templates and OCR. Procedia Computer Science, 207, 1571–1580. https://doi.org/10.1016/j.procs.2022.09.214

Kayarvizhy, N., Choudhury, A. R., Rekha, G. S., Bhuvan, G., & Sanchi, C. (2025). On-device deep learning for retrieving system and user timestamps from noisy chat images. Procedia Computer Science, 258, 3760–3770. https://doi.org/10.1016/j.procs.2025.04.631

Kim, S., Lee, B., Maqsood, M., Moon, J., & Rho, S. (2025). Deep learning-based natural language processing model and optical character recognition for detection of online grooming on social networking services. CMES - Computer Modeling in Engineering & Sciences, 143(2), 2079–2108. https://doi.org/10.32604/cmes.2025.061653

Lee, A., Yu, H., & Min, G. (2024). An algorithm of line segmentation and reading order sorting based on adjacent character detection: A post-processing of OCR for digitization of Chinese historical texts. Journal of Cultural Heritage, 67, 1–12. https://doi.org/10.1016/j.culher.2024.02.001

Lee, H., Park, Y.-C., & Lee, J. (2025). OCR-assisted masked BERT for homoglyph restoration towards multiple phishing text downstream tasks. Computers, Materials & Continua, 85(3), 4977–4993. https://doi.org/10.32604/cmc.2025.068156

Li, Y., Wei, Q., Chen, X., Li, J., Tao, C., & Xu, H. (2024). Improving tabular data extraction in scanned laboratory reports using deep learning models. Journal of Biomedical Informatics, 159, 104735. https://doi.org/10.1016/j.jbi.2024.104735

Mombelli, S., Lyle, J. R., & Breitinger, F. (2024). FAIRness in digital forensics datasets’ metadata – and how to improve it. Forensic Science International: Digital Investigation, 49, 301681. https://doi.org/10.1016/j.fsidi.2023.301681

Onim, M. S. H., Nyeem, H., Roy, K., Hasan, M., Ishmam, A., Akif, M. A. H., & Ovi, T. B. (2022). BLPnet: A new DNN model and Bengali OCR engine for automatic licence plate recognition. Array, 15, 100244. https://doi.org/10.1016/j.array.2022.100244

Paixão, T. M., Berriel, R. F., Boeres, M. C. S., Koerich, A. L., Boude, C., De Souza, A. F., & Oliveira-Santos, T. (2022). A human-in-the-loop recommendation-based framework for reconstruction of mechanically shredded documents. Pattern Recognition Letters, 164, 1–8. https://doi.org/10.1016/j.patrec.2022.10.011

Park, J., Seo, W., & Yun, T. S. (2025). End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques. Developments in the Built Environment, 23, 100733. https://doi.org/10.1016/j.dibe.2025.100733

Peña, A., Morales, A., Fierrez, J., Ortega-Garcia, J., Puente, I., Cordova, J., & Cordova, G. (2024). Continuous document layout analysis: Human-in-the-loop AI-based data curation, database, and evaluation in the domain of public affairs. Information Fusion, 108, 102398. https://doi.org/10.1016/j.inffus.2024.102398

Pino, R., Mendoza, R., & Sambayan, R. (2025). MaBaybay-OCR: A Matlab-based Baybayin optical character recognition package. SoftwareX, 29, 102003. https://doi.org/10.1016/j.softx.2024.102003

Şahin, A., Kara, B. C., & Dirsehan, T. (2025). LitOrganizer: Automating the process of data extraction and organization for scientific literature reviews. SoftwareX, 30, 102198. https://doi.org/10.1016/j.softx.2025.102198

Sinthuja, M., Padubidri, C. G., Jayachandra, G. S., Teja, M. C., & Kumar, G. S. P. (2024). Extraction of text from images using deep learning. Procedia Computer Science, 235, 789–798. https://doi.org/10.1016/j.procs.2024.04.075

Sugiyono, A. Y., Adrio, K., Tanuwijaya, K., & Suryaningrum, K. M. (2023). Extracting information from vehicle registration plate using OCR Tesseract. Procedia Computer Science, 227, 992–998. https://doi.org/10.1016/j.procs.2023.10.600

Wang, S., Moon, S., Fu, Y., & Kim, J. (2025). Construction regulatory document digitalization with layout knowledge-informed object detection and semantic text recognition. Advanced Engineering Informatics, 65(Part B), 103278. https://doi.org/10.1016/j.aei.2025.103278

Zhang, Y., Shi, Y., Zhao, P., Zhao, Y., Yang, Z., & Jin, L. (2025). MegaHan97K: A large-scale dataset for mega-category Chinese character recognition with over 97K categories. Pattern Recognition, 167, 111757. https://doi.org/10.1016/j.patcog.2025.111757

Zhao, L., Hao, R., Chai, Z., Fu, W., Yang, W., Li, C., Liu, Q., & Jiang, Y. (2024). DeepOCR: A multi-species deep-learning framework for accurate identification of open chromatin regions in livestock. Computational Biology and Chemistry, 110, 108077. https://doi.org/10.1016/j.compbiolchem.2024.108077

Downloads

Published

2025-12-08

How to Cite

Laurensia, A., Seniwati, E., & Pristyanto, Y. (2025). Transformasi Digital Pengelolaan Metadata Jurnal: Studi Eksperimental Otomasi Entri Data berbasis OCR. Edumatic: Jurnal Pendidikan Informatika, 9(3), 815–824. https://doi.org/10.29408/edumatic.v9i3.32631