Evaluating English Language Test Items Developed by Teachers: An Item Response Theory Approach
DOI:
https://doi.org/10.29408/veles.v9i1.27644Keywords:
item response theory, English test, item difficulty, English language assessmentsAbstract
Evaluating students’ abilities in educational settings is crucial for assessing learning outcomes and instructional effectiveness. In Indonesia, many schools have developed local English language assessments, yet these tests often lack psychometric validation. This study aims to evaluate the quality of a teacher-developed English language test instrument using the Item Response Theory (IRT) approach. A total of 25 multiple-choice items created by the English teacher group in Muna Regency were administered to 162 students from five randomly selected schools. A descriptive quantitative method was employed with the aid of R Studio for data analysis. Initial sample adequacy was confirmed using the Kaiser-Meyer-Olkin (KMO = 0.686) and Bartlett’s Test of Sphericity (p < .001). The study applied model fit analyses for 1-PL, 2-PL, and 3-PL logistic models, with the 2-PL model emerging as the most appropriate, as 16 items demonstrated good fit. Further analysis of item characteristics under the 2-PL model revealed that only 11 items had acceptable difficulty and discrimination indices. In comparison, the remaining 14 items were either too easy, too complex, or poorly discriminating. These results indicate that a substantial portion of the test requires revision. The study highlights the importance of psychometric evaluation in teacher-made assessments and recommends capacity-building for teachers in test development and validation practices.
References
Aiken, L. R. (1985). Three coefficients for analysing the reliability and validity of ratings. Educational and Psychological Measurement, 45(1), 131–142. https://doi.org/10.1177/0013164485451012
Akram, M., & Zepeda, S. J. (2015). Development and validation of a teacher self-assessment instrument. Journal of Research and Reflections in Education, 9(2).
Ali, L. (2018). The design of curriculum, assessment, and evaluation in higher education should be done with constructive alignment. Journal of Education and e-Learning Research, 5(1), 72–78. https://doi.org/10.20448/journal.509.2018.51.72.78
Areekkuzhiyil, S. (2021). Issues and concerns in classroom assessment practices. Edutracks, 20(8), 20-23.
Ary, D., Jacobs, L. C., Irvine, C. K. S., & Walker, D. (2019). Introduction to research in education (10th ed.). Cengage Learning.
Baker, F. B. (2001). The basics of item response theory (2nd ed.). ERIC Clearinghouse.
Baldonado, A. A., Svetina, D., & Gorin, J. (2015). Using necessary information to identify item dependence in Passage-Based Reading Comprehension tests. Applied Measurement in Education, 28(3), 202–218. https://doi.org/10.1080/08957347.2015.1042154
Banta, T. W., & Palomba, C. A. (2015). Assessment essentials: Planning, implementing, and improving assessment in higher education (2nd ed.). Jossey-Bass/Wiley.
Bichi, A. A., & Talib, R. (2018). Item Response Theory: An introduction to latent trait models for test and item development. International Journal of Evaluation and Research in Education (IJERE), 7(2), 142. https://doi.org/10.11591/ijere.v7i2.12900
Borg, S., & Edmett, A. (2018). Developing a self-assessment tool for English language teachers. Language Teaching Research, 23(5), 655–679. https://doi.org/10.1177/1362168817752543
Brown, H. D. (2005). Language assessment: Principles and classroom practices. Pearson Education.
Brown, G. T. L., & Abdulnabi, H. H. A. (2017). Evaluating the quality of higher education instructor-constructed multiple-choice tests: Impact on student grades. Frontiers in Education, 2. https://doi.org/10.3389/feduc.2017.00024
Care, E., Griffin, P., & McGaw, B. (2012). Assessment and teaching of 21st century skills (pp. 17-66). Dordrecht, The Netherlands: Springer.
Carlson, J.E., & von Davier, M. (2017). Item Response Theory. In: Bennett, R., von Davier, M. (eds) Advancing human assessment. Methodology of educational measurement and assessment. Springer, Cham. https://doi.org/10.1007/978-3-319-58689-2_5
Creswell, J. W. (2014). Research design: Qualitative, quantitative, and mixed methods approaches (4th ed.). SAGE Publications.
Darmawan, N. M., Sudarsono, Riyanti, N. D., Yuliana, N. Y. G. S., & Sumarni, N. (2022). Test items analysis of the English teacher-made test. Journal of English Education and Teaching, 6(4), 498–513. https://doi.org/10.33369/jeet.6.4.498-513
de Ayala, R. J. (2009). The theory and practice of item response theory. Guilford Press.
DeLuca, C., & Bellara, A. (2013). The current state of assessment education. Journal of Teacher Education, 64(4), 356–372. https://doi.org/10.1177/0022487113488144
Durán, R. P. (2008). Assessing English-language learners’ achievement. Review of Research in Education, 32(1), 292–327. https://doi.org/10.3102/0091732x07309372
Earl, L. M. (2013). Assessment as learning: Using classroom assessment to maximise student learning (2nd ed.). Corwin Press.
Effendi, T., & Mayuni, I. (2022). Examining a teacher-made English test in a language school. LADU Journal of Languages and Education, 2(2), 67–76. https://doi.org/10.56724/ladu.v2i2.109
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Lawrence Erlbaum Associates Publishers.
English, F. W. (2010). Deciding what to teach and test: Developing, aligning, and leading the curriculum (3rd ed.). Corwin Press.
Embretson, S., & Yang, X. (2006). Item response theory. In J. L. Green, G. Camilli, & P. B. Elmore (Eds.), Handbook of complementary methods in education research (pp. 385–409). Mahwah, NJ: Lawrence Erlbaum Associates.
Gashaye, S., & Degwale, Y. (2019). The content validity of high school English language teacher-made tests: The case of Debre Work Preparatory School, East Gojjam, Ethiopia. International Journal of Research in Engineering, IT and Social Sciences, 9(11), 41-50.
George, D., & Mallery, P. (2003). SPSS for Windows step by step: A simple guide and reference (4th ed.). Allyn & Bacon.
Gryphon, P., & Care, E. (2014). Assessment and teaching of 21st century skills: Methods and Approaches. Springer.
Haladyna, T.M., & Rodriguez, M.C. (2013). Developing and validating test items (1st ed.). Routledge. https://doi.org/10.4324/9780203850381
Hartell, E., & Strimel, G. J. (2018). What is it called and how does it work: Examining content validity and item design of teacher-made tests. International Journal of Technology and Design Education, 29(4), 781–802. https://doi.org/10.1007/s10798-018-9463-2
Hirpassa, M. (2018). Content validity of EFL teacher-made assessment: The case of communicative English skills course at Ambo University. East African Journal of Social Sciences and Humanities, 3(1), 41–62.
Immekus, J. C., Snyder, K. E., & Ralston, P. A. (2019). Multidimensional item response theory for factor structure assessment in educational psychology research. Frontiers in Education, 4, 45. https://doi.org/10.3389/feduc.2019.00045
Karim, S. A., Sudiro, S., & Sakinah, S. (2021). Utilising test items analysis to examine the level of difficulty and discriminating power in a teacher-made test. EduLite Journal of English Education Literature and Culture, 6(2), 256. https://doi.org/10.30659/e.6.2.256-269
Kasman, K., & Lubis, S. K. (2022). Teachers’ performance evaluation instrument designs in the implementation of the new learning paradigm of the MerDeka curriculum. Jurnal Kependidikan Jurnal Hasil Penelitian Dan Kajian Kepustakaan Di Bidang Pendidikan Pengajaran Dan Pembelajaran, 8(3), 760. https://doi.org/10.33394/jk.v8i3.5674
Kissi, P., Baidoo-Anu, D., Anane, E., & Annan-Brew, R. K. (2023). Teachers’ test construction competencies in an examination-oriented educational system: Exploring teachers’ multiple-choice test construction competence. Frontiers in Education, 8. https://doi.org/10.3389/feduc.2023.1154592
Lee, Y. (2019). Estimating student ability and problem difficulty using item response theory (IRT) and TrueSkill. Information Discovery and Delivery, 47(2), 67–75. https://doi.org/10.1108/idd-08-2018-0030
Lord, F.M. (1980). Applications of item response theory to practical testing problems (1st ed.). Routledge. https://doi.org/10.4324/9780203056615
Maharani, A. V., & Putro, N. H. P. S. (2020). Item analysis of the English final semester test. Indonesian Journal of EFL and Linguistics, 5(2), 491. https://doi.org/10.21462/ijefl.v5i2.302
Metsämuuronen, J. (2022). Seeking the real item difficulty: Bias-corrected item difficulty and some consequences in Rasch and IRT modelling. Behaviormetrika, 50(1), 121–154. https://doi.org/10.1007/s41237-022-00169-9
McTighe, J., & Ferrara, S. (2021). Assessing student learning by design: Principles and practices for teachers and school leaders. Teachers College Press.
Nkansah, B. K. (2018). On the Kaiser-Meier-Olkin’s measure of sampling adequacy. Mathematical theory and modeling, 8(7), 52-76.
Nitko, A. J., & Brookhart, S. M. (2014). Educational assessment of students (7th ed.). Pearson Education.
Odukoya, J. A., Adekeye, O., Igbinoba, A. O., & Afolabi, A. (2017). Item analysis of university-wide multiple choice objective examinations: The experience of a Nigerian private university. Quality & Quantity, 52(3), 983–997. https://doi.org/10.1007/s11135-017-0499-2
Osterlind, S. J. (2006). Modern measurement: Theory, principles, and applications of mental appraisal. Pearson Education.
Pastore, S. (2023). Teacher assessment literacy: A systematic review. Frontiers in Education, 8. https://doi.org/10.3389/feduc.2023.1217167
Reeve, B. (2023). Item Response Theory [IRT]. In: Maggino, F. (eds) Encyclopedia of quality of life and well-being research. Springer, Cham. https://doi.org/10.1007/978-3-031-17299-1_1556
Setiawati, F. A., Izzaty, R. E., & Hidayat, V. (2018). Items' parameters of the space-relations subtest using item response theory. Data in Brief, 19, 1785–1793. https://doi.org/10.1016/j.dib.2018.06.061
Sharma, P. (2015). Standards-based assessments in the classroom. Contemporary Education Dialogue, 12(1), 6–30. https://doi.org/10.1177/0973184914556864
Shaw, S., Crisp, V., & Johnson, N. (2012). A framework for evidencing assessment validity in large-scale, high-stakes international examinations. Assessment in Education: Principles, Policy and Practice, 19(2), 159–176. https://doi.org/10.1080/0969594x.2011.563356
Sun, J. C. Y., Wu, Y. T., & Lee, W. I. (2017). The effect of the flipped classroom approach to OpenCourseWare instruction on students’ self‐regulation. British Journal of Educational Technology, 48(3), 713-729. https://doi.org/10.1111/bjet.12444
Sundqvist, P., Wikström, P., Sandlund, E., & Nyroos, L. (2017). The teacher as examiner of L2 oral tests: A challenge to standardisation. Language Testing, 35(2), 217–238. https://doi.org/10.1177/0265532217690782
Sweeney, S. M., Sinharay, S., Johnson, M. S., & Steinhauer, E. W. (2022). An investigation of the nature and consequences of the relationship between IRT difficulty and discrimination. Educational Measurement Issues and Practice, 41(4), 50–67. https://doi.org/10.1111/emip.12522
Vatterott, C. (2015). Rethinking grading: Meaningful assessment for standards-based learning. ASCD.
Wauters, K., Desmet, P., & Van Den Noortgate, W. (2010). Adaptive item‐based learning environments based on the item response theory: possibilities and challenges. Journal of Computer Assisted Learning, 26(6), 549–562. https://doi.org/10.1111/j.1365-2729.2010.00368.x
William, D. (2011). What is assessment for learning? Studies in Educational Evaluation, 37(1), 3–14. https://doi.org/10.1016/j.stueduc.2011.03.001
Wilson, M. (2023). Constructing measures: An item response modelling approach (2nd ed.). Routledge. https://doi.org/10.4324/9781003286929
Wuntu, N. V. L. E. S. C. (2021). Analysis of teacher-made tests used in summative evaluation at SMP Negeri 1 Tompaso. Zenodo (CERN European Organisation for Nuclear Research). https://doi.org/10.5281/zenodo.5775342
Young, V. M., & Kim, D. H. (2010). Using assessments for instructional improvement: A literature review. Education Policy Analysis Archives, 18, 19. https://doi.org/10.14507/epaa.v18n19.2010
Zanon, C., Hutz, C. S., Yoo, H. H., & Hambleton, R. K. (2016). An application of item response theory to psychological test development. Psicologia: Reflexão e Crítica, 29. https://doi.org/10.1186/s41155-016-0040-x
Zieky, M. J. (2016). Fairness in test design and development. In: Dorans, N. J., & Cook, L. L. (Eds). Fairness in educational assessment and measurement (pp. 9–31). Routledge.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Rezkilaturahmi Rezkilaturahmi, Muhammad Istiqlal, Nur Hidayanto Pancoro Setyo Putro, Edi Istiyono, Widihastuti

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with the VELES Journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).
- Authors are able to enter into separate, additional contractual arrangements for the distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.
VELES Journal is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.