Quand l’IA ne parle pas notre langue : Enjeux techniques et sociolinguistiques de la reconnaissance automatique de l’arabe algérien. When AI Does Not Speak Our Language: Technical and Sociolinguistic Challenges in Automatic Speech Recognition of Algerian Arabic
Main Article Content
Abstract
Abstract: This article investigates the technical and sociolinguistic challenges of integrating Algerian Arabic into Automatic Speech Recognition and Natural Language Processing systems, particularly in contexts involving code-switching with French. Significant dialectal variation, the absence of orthographic standardization, and the overrepresentation of Modern Standard Arabic in training corpora limit these systems’ ability to accurately reflect our linguistic reality. This situation requires addressing both technical issues (phonetic, prosodic, and script diversity) and an inclusion imperative to prevent the invisibilization of dialectal varieties. To this end, we adopt a theoretical framework combining computational linguistics and digital sociolinguistics, and propose a methodology based on the collaborative construction and annotation of representative corpora that incorporate regional variants, code-switching, and scriptural diversity, with the aim of developing fairer models adapted to the Algerian context.
Résumé : Nous examinons dans cet article les enjeux techniques et sociolinguistiques liés à l’intégration de l’arabe algérien dans les systèmes de reconnaissance automatique de la parole et de traitement automatique du langage, notamment dans les situations de code-switching avec le français. La forte variation dialectale, l’absence de standardisation orthographique et la surreprésentation de l’arabe standard dans les corpus d’entraînement limitent la capacité de ces systèmes à refléter fidèlement notre réalité langagière. Cette situation impose de relever des défis techniques (diversité phonétique, prosodique et graphique) et un impératif d’inclusion pour éviter l’invisibilisation des variétés dialectales. Pour y répondre, nous mobilisons un cadre théorique associant linguistique computationnelle et sociolinguistique numérique, et proposons une méthodologie fondée sur la constitution et l’annotation collaborative de corpus représentatifs intégrant variantes régionales, alternances codiques et spécificités scripturales, en vue de développer des modèles plus équitables et adaptés au contexte algérien
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
References
ABAINIA K., (2020), «DZDC12: a new multipurpose parallel Algerian Arabizi–French code-switched corpus», Language Resources and Evaluation, 54(2), pp. 419-455.
ALSAYADI H. A. et al., (2022), «Deep investigation of the recent advances in dialectal arabic speech recognition», IEEE access, vol. 10, pp. 57 063–57 079.
ALSHARHAN E., RAMSAY A., (2020), «Investigating the effects of gender, dialect, and training size on the performance of Arabic speech recognition», Lang Resources & Evaluation, 54, pp. 975–998.
BHOGALE K.S. et al., (2024), «Empowering Low-Resource Language ASR via Large-Scale Pseudo Labeling», arXiv preprint, arXiv:2408.14026.
DHOUIB Amira et al., (2022), «Arabic Automatic Speech Recognition: A Systematic Literature Review», Applied Sciences. 12. 8898. 10.3390/app12178898.
DJANIBEKOV Amirbek et al., (2025), «Dialectal Coverage And Generalization in Arabic Speech Recognition», in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 29490–29502, Vienna, Austria. Association for Computational Linguistics.
ELNAGAR A. et al., (2021), «Systematic literature review of dialectal Arabic: identification and detection», IEEE Access, 9, pp.31010-31042.
HAMED Injy et al., (2025), «A Survey of Code-switched Arabic NLP: Progress, Challenges, and Future Directions», in Proceedings of the 31st International Conference on Computational Linguistics, pp. 4561–4585, Abu Dhabi, UAE. Association for Computational Linguistics.
HUSSEIN A., WATANABE S. et ALI A., (2022), «Reconnaissance vocale arabe par systèmes modulaires de bout en bout et humains», Computer Speech & Language , 71 , 101272.
MATTIODA M.M., (2024), «Intelligence artificielle et diversité linguistique: Quelle gestion équitable pour la garantie des droits linguistiques?», Language Problems and Language Planning, 48(2), pp.146-167.
MESSAOUDI L., (2020), Les Parlers arabes modernes du Maghreb, ENS Éditions.
MOKEDDEM A., ABDERRAHIM M., et BOUCHEMAL N., (2023), «The voice as a material clue: A new forensic Algerian corpus (Sawt El-Djazaïr)», Language Resources and Evaluation, 57, pp.1-21.
ÖZYILMAZ Ö. T., COLER M., VALDENEGRO-TORO M., (2025), «Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning», arXiv preprint arXiv:2506.02627.
RADFORD A. et al., (2023), «Robust Speech Recognition via Large-Scale Weak Supervision», in International Conference on Machine Learning. PMLR, pp. 28 492–28 518, iSSN: 2640-3498.
RAHMAN A. et al., (2024), «Arabic speech recognition: Advancement and challenges », IEEE Access, 12, pp. 39689-39716.
ROUA-HAMDANI Ghania, SELOUANI Sid Ahmed et BOUDRAA Malika, (2010), «Algerian Arabic Speech Database (ALGASD): Corpus design and automatic speech recognition application», Arabian Journal For Science And Engineering, 35, pp. 157-166.
SALHAB M., et al., (2025), «Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning», arXiv preprint arXiv:2504.12254.
SHAALAN Khaled et al., (2019), «Challenges in Arabic natural language processing», Computational linguistics, speech and image processing for arabic language, pp. 59-83.
TALAFHA B., WAHEED A. et ABDUL-MAGEED M., (2023), «N-shot benchmarking of whisper on diverse arabic speech recognition», arXiv preprint arXiv:2306.02902.
TALAFHA B. et al., (2024), «Casablanca: Data and models for multidialectal arabic speech recognition», arXiv preprint arXiv:2410.04527.
Juniper research, (2020), Voice Assistant Market: Player Strategies, Monetisation & Market Size 2020-2024. Disponible sur
[https://www.juniperresearch.com/press/number-of-voice-assistant-devices-in-use/] ( consulté le 12 /08/2025 ).
YUNPENG Liu, XUKUI Yang et DAN Qu, (2024), «Exploration of Whisper fine-tuning strategies for low-resource ASR», EURASIP Journal on Audio, Speech, and Music Processing, 10.1186/s13636-024-00349-3.
ZERGAT K.Y. et al., (2023), «The voice as a material clue: a new forensic Algerian Corpus», Multimedia Tools and Applications, 82(19), pp.29095-29113.