Speech recognition based on Spanish accent acoustic model





Automatic Speech Recognition, Language Model, CMUSphinx


The objective of the article was to generate an Automatic Speech Recognition (ASR) model based on the translation from human voice to text, being considered as one of the branches of artificial intelligence. Voice analysis allows identifying information about the acoustics, phonetics, syntax, semantics of words, among other elements where ambiguity in terms, pronunciation errors, similar syntax but different semantics can be identified, which represent characteristics of the language. The model focused on the acoustic analysis of words proposing the generation of a methodology for acoustic recognition from speech transcripts from audios containing human voice and the error rate per word was considered to identify the accuracy of the model. The audios were taken from the Integrated Security Service ECU911 that represent emergency calls registered by the entity. The model was trained with the CMUSphinx tool for the Spanish language without internet connection. The results showed that the word error rate varies in relation to the number of audios; that is, the greater the number of audios, the smaller number of erroneous words and the greater the accuracy of the model. The investigation concluded by emphasizing the duration of each audio as a variable that affects the accuracy of the model.



Download data is not yet available.


Aguiar de Lima, T., y Da Costa-Abreu, M. (2020). A Survey on Automatic Speech Recognition Systems for Portuguese Language and its Variations. Computer Speech and Language, 62. https://doi.org/10.1016/j.csl.2019.101055

Alharbi, S., Alrazgan, M., Alrashed, A., Alnomasi, T., Almojel, R., Alharbi, R., Alharbi, S., Alturki, S., Alshehri, F., y Almojil, M. (2021). Automatic Speech Recognition: Systematic Literature Review. IEEE Accedido 9: 131858–131876. https://doi.org/10.1109/ACCESS.2021.3112535

Ali, A., y Renals, S. (2018). Word error rate estimation for speech recognition: E-wer. ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), 2(2014), 20–24. https://doi.org/10.18653/v1/p18-2004

Ankit, A., Mishra, S. K., Shaikh, R., Gupta, C. K., Mathur, P., Pawar, S., y Cherukuri, A. (2016). Acoustic Speech Recognition for Marathi Language Using Sphinx. ICTACT Journal on Communication Technology, 7(3), 1361–1365. https://doi.org/10.21917/ijct.2016.0201

Celis, J., Llanos, R., Medina, B., Sepúlveda, S., y Castro, S. (2017). Acoustic and Language Modeling for Speech Recognition of a Spanish Dialect from the Cucuta Colombian Region. Ingeniería, 22(3): 362–376. https://doi.org/10.14483/23448393.11616

Belinkov, Y., y Glass, J. (2019). Analysis Methods in Neural Language Processing: A Survey. Transactions of the Association for Computational Linguistics, 7, 49-72.

Dhankar, A. (2017). Study of deep learning and CMU sphinx in automatic speech recognition. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (2296-2301). IEEE.

Singh, R., Raj, B., y Stern, R. M. (2018). Model Compensation and Matched Condition Methods for Robust Speech Recognition. En Noise Reduction in Speech Applications (pp. 245-275). CRC press.

Errattahi, R., El Hannani, A., y Ouahmane, H. (2018). Automatic Speech Recognition Errors Detection and Correction: A review. Procedia Computer Science,128: 32-37.

Peinl, R., Rizk, B., y Szabad, R. (2020). Open-source Speech Recognition on Edge Devices. En 2020 10th International Conference on Advanced Computer Information Technologies (ACIT) (pp. 441-445). IEEE.

Kim, D., Oh, J., Im, H., Yoon, M., Park, J., y Lee, J. (2021). Automatic Classification of the Korean Triage Acuity Scale in Simulated Emergency Rooms Using Speech Recognition and Natural Language Processing: A Proof of Concept Study. Journal of Korean Medical Science, 36(27): 1-13. https://doi.org/10.3346/JKMS.2021.36.E175

Lakdawala, B., Khan, F., Khan, A., Tomar, Y., Gupta, R., & Shaikh, A. (2018). Voice to Text transcription using CMU Sphinx A mobile application for healthcare organization. Proceedings of the International Conference on Inventive Communication and Computational Technologies, ICICCT 2018, Icicct, 749–753. https://doi.org/10.1109/ICICCT.2018.8473305

Medina, F., Piña, N., Mercado, I., y Rusu, C. (2014). Reconocimiento de palabras en español con Julius. ACM International Conference Proceeding Series, 2241. https://doi.org/10.1145/2590651.2590660

Peralta Vásconez, J. J., Narváez Ortiz, C. A., Orellana Cordero, M. P., Patiño León, P. A., y Cedillo Orellana, P. (2021). Evaluación del reconocimiento de voz entre los servicios de Google y Amazon aplicado al Sistema Integrado de Seguridad ECU 911. Revista Tecnológica - ESPOL, 33(2): 147-158. https://doi.org/10.37815/rte.v33n2.840

Tavi, L., Alumäe, T., y Werner, S. (2019). Recognition of Creaky Voice from Emergency calls. Proceedings of the Annual Conference of the International Speech Communication Association. INTERSPEECH, 2019-Septe: 1990-1994. https://doi.org/10.21437/Interspeech.2019-1253

Zhao, L., Alhoshan, W., Ferrari, A., Letsholo, K. J., Ajagbe, M. A., Chioasca, E.-V., y Batista-Navarro, R. T. (2020). Natural Language Processing (NLP) for Requirements Engineering: A Systematic Mapping Study. Computing Surveys 54(3): 1-41.



How to Cite

Plaza Salto, J. G., Cristina, S.-Z., Acosta Urigüen, M. I., Orellana Cordero, M. P., Cedillo Orellana, I. P., & Zambrano-Martínez, J. L. (2022). Speech recognition based on Spanish accent acoustic model. Enfoque UTE, 13(3). https://doi.org/10.29019/enfoqueute.839