The Implementation of Automated Speech Recognition (ASR) in ELT Classroom: A Systematic Literature Review from 2012-2023

This study systematically reviews the literature on the use of Automated Speech Recognition (ASR) in English Language Teaching (ELT) classrooms, focusing on its application in speaking assessments. ASR technology, which converts speech audio into text, offers significant advantages in teaching pronunciation and evaluating learners' spoken language. It opens avenues for interactive aural engagements between students and computers. This research aims to scrutinize the effectiveness of ASR in speaking assessment, emphasizing the validity of automated scores and addressing potential technological inaccuracies or limitations. Despite existing research on technology's role in speaking evaluation, there remains a paucity of empirical studies that systematically review ASR's educational applications. This gap underscores the need for an up-to-date, comprehensive exploration of ASR's implementation in education. For this review, ten studies from 2012 to 2023 were selected from renowned academic databases, including Taylor & Francis, Wiley, and Springer. The findings highlight ASR's educational benefits, such as enhancing student progress, fostering interaction, and offering significant pedagogical contributions. Notably, the integration of ASR in speaking assessments facilitates a synergistic relationship between human and automated scoring, enriching the assessment process. This study serves as a valuable resource for educators seeking to optimize speaking assessments in ELT classrooms through ASR technology. It sheds light on the methodological advancements and pedagogical implications of integrating ASR, providing a foundation for future research and practical applications in language education.

The Implementation of Automated Speech Recognition (ASR) in ELT Classroom: A Systematic Literature Review from 2012-2023

INTRODUCTION
The exploration of Automated Speech Recognition (ASR) technology has formed a substantial part of recent academic inquiry.ASR, a technology that interprets spoken language into text, has seen remarkable advancements in recent years (Ivanov et al. 2015).This progression has led to its extensive application in various commercial systems, including travel reservations, financial services, and weather forecasting, as highlighted by Godwin (2009).Despite its technological advancements, ASR's integration into English Language Teaching (ELT) has been met with mixed reactions.The field has witnessed the emergence of numerous commercial English language learning software products, often falling short of their ambitious promises.This inconsistency in efficacy has led to a turbulent reputation for ASR within the ELT community.Historically, ASR has been a component of computer-assisted language learning (CALL) systems, also referred to as "voice-interactive CALL" (Carrier, 2017).These systems primarily utilized ASR for pronunciation training and initiating dialogic interactions, underscoring its potential in language education.Beyond pronunciation aid, ASR offers broader applications in analyzing learners' speech, fostering opportunities for interactive auditory exchanges between students and computer systems.
The transformative aspect of ASR lies in its ability to convert speech audio into written text, thus serving as a critical tool for teaching pronunciation (Inceoglu, 2020).Its use extends to evaluating learners' speech on a wider scale and establishing foundations for auditory interactions.ASR facilitates language learning by enabling programs to "listen" to learners' pronunciation, offering formative assessment and feedback on phonological accuracy.A key aspect of ASR's role in ELT is its contribution to pedagogical practices.It enables instructors to effectively assess spoken English in classrooms, bridging the gap between technology and teaching.The technology's impact on pedagogy extends to how educators perceive its benefits and limitations in second language (L2) instruction.ASR's capability to offer direct feedback and pronunciation assessment simplifies the teaching process, making it an invaluable tool in modern language education.
The assessment of English-speaking skills through Automated Speech Recognition (ASR) and Spoken Dialogue Systems (SDS) has been a focus of numerous studies in recent years, as evidenced by works such as Laughlin et al. (2020), Bashori et al. (2022), Gu et al. (2020), andKoizumi (2020).ASR, forming the basis of interactive SDS, incorporates both speech and natural language processing technologies to enable comprehensive humanmachine interactions.This integration has proven instrumental in providing students with targeted and meaningful speaking practice.A notable trend over the past decade is the accelerating advancement in the use of ASR and SDS for assessing English speaking skills.Earlier, the application of these technologies in language assessment was limited, with ASR being more oriented towards computer-assisted language learning (CALL) rather than evaluative grading, and SDS focused on development in specific application domains (Litman et al., 2018).However, their role in speaking assessments has significantly expanded recently.
Several studies, such as those by Laughlin et al. (2020), have highlighted the effectiveness of using automated voice recognition and natural language processing in engaging students in productive speaking practice, particularly in flipped classroom models.Research by Bashori et al. (2022) demonstrated that ASR-based websites substantially aid students in enhancing their receptive vocabulary, showcasing the technology's capability to improve both vocabulary and pronunciation.Despite these advancements, ASR's journey in English Language Teaching (ELT) has been tumultuous.Godwin (2009) notes its widespread application in various commercial systems, yet its implementation in ELT has often been plagued by overpromising and underdelivering commercial language learning software.Earlier applications of ASR in CALL, as described by Carrier (2017), were primarily focused on pronunciation training and initiating dialogues rather than comprehensive language assessment.Moreover, the reliability of ASR in ELT suffered due to high rates of false positives and negatives, leading to frustrating learning experiences.However, recent technological advancements have significantly improved recognition quality and accuracy, renewing interest and expanding possibilities in language learning and assessment.
Despite extensive research on Automated Speech Recognition (ASR) and Spoken Dialogue Systems (SDS) in language education, a significant gap persists.A key issue is the frequent development of technology-based speech recognition tools without adequate teacher involvement.The complexity of ASR is amplified by factors such as background noise, low signal-to-noise ratio, pronunciation variations, and the continuous nature of speech, which are particularly challenging when processing non-native speech (Litman et al., 2018;Cucchiarini et al., 2010).Moreover, studies have primarily focused on constrained speech tasks, such as reading aloud or elicited imitation, which do not fully capture the nuances of spontaneous speech.Attempts to assess more natural, less controlled speech have been made (e.g., Cuccharini et al., 2002;Zechner et al., 2009), but these present greater challenges for speech technology, resulting in increased errors.The calibration of task difficulty also remains a concern, as overly simple tasks may not generate sufficient errors for meaningful analysis.In addition, the deficiency in the existing literature is the limited integration of systematic research in evaluating English as a Foreign Language (EFL) speaking skills using advanced speech technologies like ASR and SDS, especially in diverse educational settings and among students with varying abilities, including those with disabilities.This gap highlights the need for an in-depth examination of ASR's role in education through systematic literature reviews.
In addressing the identified gaps, this study embarks on a comprehensive and updated systematic literature review of Automated Speech Recognition (ASR) implementation in educational contexts.Central to this investigation are two critical research objectives.Firstly, the study seeks to unravel the current research status of ASR in education.This involves a thorough exploration of existing literature to gauge the depth and breadth of ASR's application in various educational settings.Secondly, the study aims to uncover the benefits and practical implications of applying ASR technology in the realm of language assessment.Special emphasis is placed on its impact on enhancing students' speaking skills.By achieving these objectives, the study intends to offer valuable insights for educators and students considering the use of ASR in speaking assessments.It is particularly focused on evaluating the balance between human and automated scoring, assessing their validity, and identifying potential challenges or errors associated with the use of ASR technology in language assessment.

METHOD
To conduct a rigorous systematic review on the topic of Automated Speech Recognition (ASR) in education, we meticulously followed the three-step framework proposed by Mcquade (2021) and Tranfield et al. (2003).This comprehensive approach consists of planning the review, conducting the review, and reporting and disseminating the findings.In the initial stage of our research, we established the need and objectives for this systematic literature review.This was accomplished by conducting a scoping study in the relevant field of speech technology in education, which helped us in identifying the key areas that required in-depth exploration.This stage was critical in laying the groundwork and defining the scope of our review.
Furthermore, our focus shifted towards the meticulous selection of high-quality research papers.We conducted an extensive keyword search across several acclaimed academic databases, including Taylor & Francis, Wiley Library, and Springer.The selection of these databases was strategic; our aim was to source authoritative research articles that were suitable for in-depth analysis and to ensure that our review would yield representative and reliable conclusions.To guide our selection, we employed specific inclusion criteria, focusing on papers that mentioned key terms such as "ASR", "speech technology," "technology in speaking assessment," "pronunciation," "speaking proficiency," "oral communication," and "automated speech recognition."Moreover, we imposed temporal and linguistic boundaries on our search, limiting it to papers published within the last decade (2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019)(2020)(2021)(2022) and ensuring that all selected articles were written in English.
To ensure a comprehensive capture of articles relevant to our study, we utilized a set of representative search terms that accurately reflected the essence of our research.These keywords were organized into two main clusters: "Automated Speech Recognition" and "Secondary School."The inclusion of related terms such as "middle school", "high school," and "children" was crucial for gathering information pertinent to our target age groups.Our initial search, using the string "Automated Speech Recognition" AND (middle school OR high school OR secondary school OR primary school OR elementary school OR kindergarten OR preschool OR children OR child), was conducted within the Social Sciences Citation Index (SSCI) of the chosen databases.This search yielded a total of 20 papers.We then meticulously examined the titles, abstracts, and methods sections of these papers against our predefined inclusion criteria.Through this careful screening, we narrowed down our selection to 10 papers that met our rigorous standards for further in-depth analysis.

RQ1: What is the current research status of ASR in the ELT Classroom?
We discovered information regarding the current research on Automated Speech Recognition (ASR) in ELT classrooms or speaking classrooms.

Figure 1. Publication year of ASR papers
The distribution of the selected empirical studies, as illustrated in Figure 1, offers a revealing perspective on the recent trends in research related to ASR in the educational sector.Among the 10 chosen studies, there is a noticeable concentration of publications post-2014, underscoring a growing academic interest in this field.This trend is particularly evident with the emergence of three articles in 2022 alone, signaling a significant surge in well-founded empirical research in the last five years.This increase can be attributed to advancements in technology, particularly in areas that require computers to interpret, understand, and respond to user voice commands.These developments have spurred a heightened focus on enhancing human-machine interaction.A critical aspect of this burgeoning research area, as noted by Kanabur et al. (2019), is the understanding of human speech production and the development of accurate speech recognition models.This focus is not only fundamental to the technological advancement of ASR systems but also pivotal to their effective application in educational settings.Given the rapid progression in empirical research and the expanding potential for ASR's applicability in education, it is anticipated that the coming years will witness a continued upsurge in evidence-based research in this domain.Such advancements are poised to further transform the landscape of educational technology, particularly in the realm of language learning and assessment.
Figure 2. The educational level of ASR papers A significant number of studies within the realm of Automated Speech Recognition (ASR) in English Language Teaching (ELT) classrooms have predominantly centered on secondary and higher education students.This trend points to an imbalanced and often overlooked distribution across different age cohorts in ASR research.The emphasis on integrating ASR technologies in advanced educational settings, such as universities or sophisticated language courses, has highlighted the need for a deeper comprehension of these technologies and their practical applications in language learning environments.Students in higher education typically have more advanced expectations regarding the refinement and effectiveness of ASR tools.
This demand necessitates the development of more intricate features and functionalities within ASR systems to adequately meet their language learning needs (Knill et al. 2018).Furthermore, the use of ASR in these settings often requires educators to possess a specialized understanding of how to effectively incorporate this technology into their teaching methods.This situation underscores the importance of tailored professional development programs designed to equip educators with the necessary skills and knowledge for successfully integrating ASR technology in higher education environments.Understanding the unique requirements and challenges inherent at this educational level is crucial.It provides critical insights into how ASR technology can be optimized for use in higher education contexts, ensuring that it meets the specific needs of both learners and educators.Recognizing and addressing these aspects is essential for the effective assimilation and utilization of ASR in advanced educational settings, ultimately contributing to more effective and engaging language learning experiences.

RQ2: What are the benefits of applying ASR in education?
The application of Automated Speech Recognition (ASR) in English Language Teaching (ELT) classrooms offers a range of significant benefits.

Author(s)
Categories Sub-Categories

Similar Stuttering Audio
The evaluation of ASR systems using ASTER is accurate and reliable.

Bug Identification
Categorize recognition errors in ASR systems into five bug types.Identify specific areas of improvement for ASR systems.Tao, J., Evanini, K., & Wang, X. (2014)

Enhanced automated scoring
Calculate more precise scores based on the extracted features.

Efficient assessment process
More reliable and consistent assessment of students' spoken language proficiency.Reduce the need for manual transcription and scoring.Automatically transcribe and score spoken responses.

Improved accuracy
Accurately transcribe speech from nonnative speakers.
The integration of Automatic Speech Recognition (ASR) technology into language learning has brought about a host of significant benefits, transforming traditional language instruction into a more interactive, self-paced, and personalized experience.ASR tools have significantly enhanced students' progress by offering immediate feedback and personalized learning paths.This technology facilitates interactive opportunities between students, teachers, and learning materials, contributing to improved pronunciation skills and vocabulary acquisition.Studies like those by Ahn & Lee (2015) and Bashori et al. (2022) have shown how ASR-based language learning systems, such as I Love Indonesia (ILI) and NovoLearning (NOVO), significantly improve vocabulary and pronunciation skills among Indonesian secondary school students.These advancements have led to increased student engagement and motivation, making language learning more appealing and effective.
The adaptability and robustness of ASR technology, including its ability to handle noise and separate context in encoding, has improved the accuracy of pronunciation assessment and feedback.This precision allows for more efficient evaluation and tailored interventions.ASR's role in enhancing generalization through diverse neural network analyses has also enabled a deeper understanding of students' proficiency levels, leading to customized teaching approaches (Chen, 2022).Scenario-based tasks have further encouraged student engagement through contextualized language learning, providing significant opportunities for authentic speaking practice.
Furthermore, Research by Gu et al. (2020) demonstrates the value of computerbased speaking tests in providing diagnostic feedback for test preparation, such as for the TOEFL iBT examination.Additionally, studies by Koizumi (2022) and Timpe- Laughlin et al. (2020) have explored the pedagogical perspective, investigating teachers' perceptions of the benefits and limitations of ASR technology in L2 instruction and its implementation in specific teaching contexts.However, despite its potential, the implementation of ASR systems has encountered challenges, particularly in recognizing and transcribing stuttering speech patterns.Initiatives like the Assessment of Stuttering Speech with Error Recognition (ASTER) aim to provide accurate evaluations and insights into the recognition of errors, leading to targeted improvements.
In addition, the early attempts to apply Automatic Speech Recognition (ASR) to language acquisition encountered mixed results, primarily due to the use of systems designed for native speakers and other purposes, such as dictation.Early studies by Coniam (1999) and Derwing et al. (2000) highlighted these challenges.The difficulty in ASR for native speech arises from factors like background noise, low signal-to-noise ratio (SNR), disfluencies, pronunciation variation, and the blending of terms in speech.However, ASR for learner's speech, particularly in second language (L2) learning, poses even greater challenges.The grammar, vocabulary, and pronunciation in L2 speech often deviate significantly from native speech norms, impacting all three "knowledge sources" of the ASR system.L2 speech also contains more disfluencies and hesitation phenomena, which vary with the learner's proficiency level (Cucchiarini et al., 2010).
These disparities between first language (L1) and L2 speech significantly impair ASR performance.As noted by Van (2001) and Van et al. (2010), the performance issues in early ASR applications were more due to the inappropriate use of systems rather than inherent flaws in ASR technology itself.Recognizing this, professionals in the field have emphasized the need to tailor ASR systems to the specific needs of language learners and to integrate student-specific materials in the learning process.The core of ASR systems lies in training language and acoustic models.These systems utilize a lexicon and a vast corpus of audio files as input.The language model calculates probabilities for individual words and their sequences, while acoustic models represent the sounds of a language.The lexicon bridges the acoustic and language models, providing both phonological and orthographic transcriptions for each entry.Lexicons often include multiple entries for words to account for different pronunciation variants.
Once trained, ASR systems can process new speech samples, employing various knowledge sources to recognize spoken words accurately (Malik et al. 2021).This process underlines the importance of customizing ASR systems to the specific context of L2 learning, taking into account the unique linguistic characteristics and challenges faced by language learners.As the technology and understanding of ASR's applications in language learning have evolved, there has been a shift towards developing more learner-centric ASR systems that are better suited to the needs and nuances of L2 speech.The efficiency of ASR technology in assessment processes has enabled more precise scoring and reduced the need for manual transcription, saving time and resources for both teachers and students.In conclusion, ASR technology has demonstrated its potential in fostering interactive and personalized learning experiences.It has transformed traditional language instruction into a more self-directed and guided learning process.While addressing existing challenges remains crucial, the overall benefits of ASR in educational settings support the premise that this technology can significantly aid students' progress, make pedagogical contributions, and facilitate enhanced student-teacher interactions.

CONCLUSION
This systematic literature review was conducted with the aim of exploring the gap in the application of Automatic Speech Recognition (ASR) technology in language assessment, particularly focusing on the balance between human and automated scoring based on their validity and the potential challenges or errors that may arise when using ASR as a tool for speaking assessment.The review revealed several key factors that highlight the benefits of using ASR in assessing English speaking skills.These include the enhancement of students' progress, increased interaction, significant pedagogical contributions, and improved performance accuracy.One of the critical findings is the role of pedagogical contributions in validating the scores obtained from ASR.These contributions are seen in the effective collaboration between human and automated scoring, ensuring a more comprehensive and accurate assessment of students' speaking abilities.
An important observation from the review is that significant problems did not emerge during the application of ASR in classroom settings.This indicates that, when implemented effectively, ASR technology can be a valuable tool in assessing speaking abilities without major disruptions or issues.This aspect is particularly beneficial for teachers and students in secondary school English as a Foreign Language (EFL) classes, where the need for accurate and efficient assessment of speaking skills is paramount.Overall, the findings from this systematic literature review provide insightful implications for the use of ASR technology in language learning and assessment.They suggest that ASR, when used thoughtfully and in conjunction with traditional pedagogical methods, can significantly enhance the language learning process, particularly in speaking assessments.This integration of ASR in EFL classrooms can offer teachers and students a more dynamic and effective approach to language assessment, contributing to improved learning outcomes and language proficiency.