Evaluating English Language Test Items Developed by Teachers: An Item Response Theory Approach

Authors

  • Rezkilaturahmi Universitas Negeri Yogyakarta
  • Muhammad Istiqlal UIN Salatiga
  • Nur Hidayanto Pancoro Setyo Putro Universitas Negeri Yogyakarta
  • Edi Istiyono Universitas Negeri Yogyakarta
  • Widihastuti Universitas Negeri Yogyakarta

DOI:

https://doi.org/10.29408/veles.v9i1.27644

Keywords:

item response theory, English test, item difficulty, English language assessments

Abstract

Evaluating students’ abilities in educational settings is crucial for assessing learning outcomes and instructional effectiveness. In Indonesia, many schools have developed local English language assessments, yet these tests often lack psychometric validation. This study aims to evaluate the quality of a teacher-developed English language test instrument using the Item Response Theory (IRT) approach. A total of 25 multiple-choice items created by the English teacher group in Muna Regency were administered to 162 students from five randomly selected schools. A descriptive quantitative method was employed with the aid of R Studio for data analysis. Initial sample adequacy was confirmed using the Kaiser-Meyer-Olkin (KMO = 0.686) and Bartlett’s Test of Sphericity (p < .001). The study applied model fit analyses for 1-PL, 2-PL, and 3-PL logistic models, with the 2-PL model emerging as the most appropriate, as 16 items demonstrated good fit. Further analysis of item characteristics under the 2-PL model revealed that only 11 items had acceptable difficulty and discrimination indices. In comparison, the remaining 14 items were either too easy, too complex, or poorly discriminating. These results indicate that a substantial portion of the test requires revision. The study highlights the importance of psychometric evaluation in teacher-made assessments and recommends capacity-building for teachers in test development and validation practices.

Author Biographies

Muhammad Istiqlal, UIN Salatiga

Mathematics Education Department

Nur Hidayanto Pancoro Setyo Putro, Universitas Negeri Yogyakarta

English Language Education Department

Edi Istiyono, Universitas Negeri Yogyakarta

Educational Research and Evaluation

Widihastuti, Universitas Negeri Yogyakarta

Educational Research and Evaluation

References

Aiken, L. R. (1985). Three coefficients for analysing the reliability and validity of ratings. Educational and Psychological Measurement, 45(1), 131–142. https://doi.org/10.1177/0013164485451012

Akram, M., & Zepeda, S. J. (2015). Development and validation of a teacher self-assessment instrument. Journal of Research and Reflections in Education, 9(2).

Ali, L. (2018). The design of curriculum, assessment, and evaluation in higher education should be done with constructive alignment. Journal of Education and e-Learning Research, 5(1), 72–78. https://doi.org/10.20448/journal.509.2018.51.72.78

Areekkuzhiyil, S. (2021). Issues and concerns in classroom assessment practices. Edutracks, 20(8), 20-23.

Ary, D., Jacobs, L. C., Irvine, C. K. S., & Walker, D. (2019). Introduction to research in education (10th ed.). Cengage Learning.

Baker, F. B. (2001). The basics of item response theory (2nd ed.). ERIC Clearinghouse.

Baldonado, A. A., Svetina, D., & Gorin, J. (2015). Using necessary information to identify item dependence in Passage-Based Reading Comprehension tests. Applied Measurement in Education, 28(3), 202–218. https://doi.org/10.1080/08957347.2015.1042154

Banta, T. W., & Palomba, C. A. (2015). Assessment essentials: Planning, implementing, and improving assessment in higher education (2nd ed.). Jossey-Bass/Wiley.

Bichi, A. A., & Talib, R. (2018). Item Response Theory: An introduction to latent trait models for test and item development. International Journal of Evaluation and Research in Education (IJERE), 7(2), 142. https://doi.org/10.11591/ijere.v7i2.12900

Borg, S., & Edmett, A. (2018). Developing a self-assessment tool for English language teachers. Language Teaching Research, 23(5), 655–679. https://doi.org/10.1177/1362168817752543

Brown, H. D. (2005). Language assessment: Principles and classroom practices. Pearson Education.

Brown, G. T. L., & Abdulnabi, H. H. A. (2017). Evaluating the quality of higher education instructor-constructed multiple-choice tests: Impact on student grades. Frontiers in Education, 2. https://doi.org/10.3389/feduc.2017.00024

Care, E., Griffin, P., & McGaw, B. (2012). Assessment and teaching of 21st century skills (pp. 17-66). Dordrecht, The Netherlands: Springer.

Carlson, J.E., & von Davier, M. (2017). Item Response Theory. In: Bennett, R., von Davier, M. (eds) Advancing human assessment. Methodology of educational measurement and assessment. Springer, Cham. https://doi.org/10.1007/978-3-319-58689-2_5

Creswell, J. W. (2014). Research design: Qualitative, quantitative, and mixed methods approaches (4th ed.). SAGE Publications.

Darmawan, N. M., Sudarsono, Riyanti, N. D., Yuliana, N. Y. G. S., & Sumarni, N. (2022). Test items analysis of the English teacher-made test. Journal of English Education and Teaching, 6(4), 498–513. https://doi.org/10.33369/jeet.6.4.498-513

de Ayala, R. J. (2009). The theory and practice of item response theory. Guilford Press.

DeLuca, C., & Bellara, A. (2013). The current state of assessment education. Journal of Teacher Education, 64(4), 356–372. https://doi.org/10.1177/0022487113488144

Durán, R. P. (2008). Assessing English-language learners’ achievement. Review of Research in Education, 32(1), 292–327. https://doi.org/10.3102/0091732x07309372

Earl, L. M. (2013). Assessment as learning: Using classroom assessment to maximise student learning (2nd ed.). Corwin Press.

Effendi, T., & Mayuni, I. (2022). Examining a teacher-made English test in a language school. LADU Journal of Languages and Education, 2(2), 67–76. https://doi.org/10.56724/ladu.v2i2.109

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Lawrence Erlbaum Associates Publishers.

English, F. W. (2010). Deciding what to teach and test: Developing, aligning, and leading the curriculum (3rd ed.). Corwin Press.

Embretson, S., & Yang, X. (2006). Item response theory. In J. L. Green, G. Camilli, & P. B. Elmore (Eds.), Handbook of complementary methods in education research (pp. 385–409). Mahwah, NJ: Lawrence Erlbaum Associates.

Gashaye, S., & Degwale, Y. (2019). The content validity of high school English language teacher-made tests: The case of Debre Work Preparatory School, East Gojjam, Ethiopia. International Journal of Research in Engineering, IT and Social Sciences, 9(11), 41-50.

George, D., & Mallery, P. (2003). SPSS for Windows step by step: A simple guide and reference (4th ed.). Allyn & Bacon.

Gryphon, P., & Care, E. (2014). Assessment and teaching of 21st century skills: Methods and Approaches. Springer.

Haladyna, T.M., & Rodriguez, M.C. (2013). Developing and validating test items (1st ed.). Routledge. https://doi.org/10.4324/9780203850381

Hartell, E., & Strimel, G. J. (2018). What is it called and how does it work: Examining content validity and item design of teacher-made tests. International Journal of Technology and Design Education, 29(4), 781–802. https://doi.org/10.1007/s10798-018-9463-2

Hirpassa, M. (2018). Content validity of EFL teacher-made assessment: The case of communicative English skills course at Ambo University. East African Journal of Social Sciences and Humanities, 3(1), 41–62.

Immekus, J. C., Snyder, K. E., & Ralston, P. A. (2019). Multidimensional item response theory for factor structure assessment in educational psychology research. Frontiers in Education, 4, 45. https://doi.org/10.3389/feduc.2019.00045

Karim, S. A., Sudiro, S., & Sakinah, S. (2021). Utilising test items analysis to examine the level of difficulty and discriminating power in a teacher-made test. EduLite Journal of English Education Literature and Culture, 6(2), 256. https://doi.org/10.30659/e.6.2.256-269

Kasman, K., & Lubis, S. K. (2022). Teachers’ performance evaluation instrument designs in the implementation of the new learning paradigm of the MerDeka curriculum. Jurnal Kependidikan Jurnal Hasil Penelitian Dan Kajian Kepustakaan Di Bidang Pendidikan Pengajaran Dan Pembelajaran, 8(3), 760. https://doi.org/10.33394/jk.v8i3.5674

Kissi, P., Baidoo-Anu, D., Anane, E., & Annan-Brew, R. K. (2023). Teachers’ test construction competencies in an examination-oriented educational system: Exploring teachers’ multiple-choice test construction competence. Frontiers in Education, 8. https://doi.org/10.3389/feduc.2023.1154592

Lee, Y. (2019). Estimating student ability and problem difficulty using item response theory (IRT) and TrueSkill. Information Discovery and Delivery, 47(2), 67–75. https://doi.org/10.1108/idd-08-2018-0030

Lord, F.M. (1980). Applications of item response theory to practical testing problems (1st ed.). Routledge. https://doi.org/10.4324/9780203056615

Maharani, A. V., & Putro, N. H. P. S. (2020). Item analysis of the English final semester test. Indonesian Journal of EFL and Linguistics, 5(2), 491. https://doi.org/10.21462/ijefl.v5i2.302

Metsämuuronen, J. (2022). Seeking the real item difficulty: Bias-corrected item difficulty and some consequences in Rasch and IRT modelling. Behaviormetrika, 50(1), 121–154. https://doi.org/10.1007/s41237-022-00169-9

McTighe, J., & Ferrara, S. (2021). Assessing student learning by design: Principles and practices for teachers and school leaders. Teachers College Press.

Nkansah, B. K. (2018). On the Kaiser-Meier-Olkin’s measure of sampling adequacy. Mathematical theory and modeling, 8(7), 52-76.

Nitko, A. J., & Brookhart, S. M. (2014). Educational assessment of students (7th ed.). Pearson Education.

Odukoya, J. A., Adekeye, O., Igbinoba, A. O., & Afolabi, A. (2017). Item analysis of university-wide multiple choice objective examinations: The experience of a Nigerian private university. Quality & Quantity, 52(3), 983–997. https://doi.org/10.1007/s11135-017-0499-2

Osterlind, S. J. (2006). Modern measurement: Theory, principles, and applications of mental appraisal. Pearson Education.

Pastore, S. (2023). Teacher assessment literacy: A systematic review. Frontiers in Education, 8. https://doi.org/10.3389/feduc.2023.1217167

Reeve, B. (2023). Item Response Theory [IRT]. In: Maggino, F. (eds) Encyclopedia of quality of life and well-being research. Springer, Cham. https://doi.org/10.1007/978-3-031-17299-1_1556

Setiawati, F. A., Izzaty, R. E., & Hidayat, V. (2018). Items' parameters of the space-relations subtest using item response theory. Data in Brief, 19, 1785–1793. https://doi.org/10.1016/j.dib.2018.06.061

Sharma, P. (2015). Standards-based assessments in the classroom. Contemporary Education Dialogue, 12(1), 6–30. https://doi.org/10.1177/0973184914556864

Shaw, S., Crisp, V., & Johnson, N. (2012). A framework for evidencing assessment validity in large-scale, high-stakes international examinations. Assessment in Education: Principles, Policy and Practice, 19(2), 159–176. https://doi.org/10.1080/0969594x.2011.563356

Sun, J. C. Y., Wu, Y. T., & Lee, W. I. (2017). The effect of the flipped classroom approach to OpenCourseWare instruction on students’ self‐regulation. British Journal of Educational Technology, 48(3), 713-729. https://doi.org/10.1111/bjet.12444

Sundqvist, P., Wikström, P., Sandlund, E., & Nyroos, L. (2017). The teacher as examiner of L2 oral tests: A challenge to standardisation. Language Testing, 35(2), 217–238. https://doi.org/10.1177/0265532217690782

Sweeney, S. M., Sinharay, S., Johnson, M. S., & Steinhauer, E. W. (2022). An investigation of the nature and consequences of the relationship between IRT difficulty and discrimination. Educational Measurement Issues and Practice, 41(4), 50–67. https://doi.org/10.1111/emip.12522

Vatterott, C. (2015). Rethinking grading: Meaningful assessment for standards-based learning. ASCD.

Wauters, K., Desmet, P., & Van Den Noortgate, W. (2010). Adaptive item‐based learning environments based on the item response theory: possibilities and challenges. Journal of Computer Assisted Learning, 26(6), 549–562. https://doi.org/10.1111/j.1365-2729.2010.00368.x

William, D. (2011). What is assessment for learning? Studies in Educational Evaluation, 37(1), 3–14. https://doi.org/10.1016/j.stueduc.2011.03.001

Wilson, M. (2023). Constructing measures: An item response modelling approach (2nd ed.). Routledge. https://doi.org/10.4324/9781003286929

Wuntu, N. V. L. E. S. C. (2021). Analysis of teacher-made tests used in summative evaluation at SMP Negeri 1 Tompaso. Zenodo (CERN European Organisation for Nuclear Research). https://doi.org/10.5281/zenodo.5775342

Young, V. M., & Kim, D. H. (2010). Using assessments for instructional improvement: A literature review. Education Policy Analysis Archives, 18, 19. https://doi.org/10.14507/epaa.v18n19.2010

Zanon, C., Hutz, C. S., Yoo, H. H., & Hambleton, R. K. (2016). An application of item response theory to psychological test development. Psicologia: Reflexão e Crítica, 29. https://doi.org/10.1186/s41155-016-0040-x

Zieky, M. J. (2016). Fairness in test design and development. In: Dorans, N. J., & Cook, L. L. (Eds). Fairness in educational assessment and measurement (pp. 9–31). Routledge.

Downloads

Published

2025-04-29

How to Cite

Rezkilaturahmi, Istiqlal, M., Pancoro Setyo Putro, N. H., Istiyono, E., & Widihastuti. (2025). Evaluating English Language Test Items Developed by Teachers: An Item Response Theory Approach. Voices of English Language Education Society, 9(1), 218–230. https://doi.org/10.29408/veles.v9i1.27644

Issue

Section

Articles