Automatic Identification of Arabic Dialects
2010, Language Resources and Evaluation
https://doi.org/10.13016/M2BK16T26…
5 pages
1 file
Sign up for access to the world's latest research
Abstract
In this work, automatic recognition of Arabic diale cts is proposed. An acoustic survey of the proporti on of vocalic intervals and the standard deviation of consonantal intervals in nine dialects (Tunisia, Morocco, Algeria, Egypt, Syria, Lebanon, Yemen, Golf's Countries and Iraq) is performed using the platform Alize and Gaussian Mixture Models (GMM). The results show the complexity of the
Related papers
2020
Arabic is the fourth most used language on the Internet and the official language of more than 20 countries around the world. It has three main varieties, Modern Standard Arabic, which is used in books, news and education, local Dialects that vary from region to another, and Classical Arabic, the written language of the Quran. Maghrebi dialect is the Arabic dialect language used in North African countries, where internet users from these countries feel more comfortable using local slangs than native Arabic. In this study, we present a large dataset of regional dialects of three countries, namely Algeria, Tunisia, and Morocco, then we investigate the identification of each dialect using a machine learning classifiers with TF-IDF features. The approach shows promising results, where we achieved accuracy up to 96%.
Interspeech 2009
While Modern Standard Arabic is the formal spoken and written language of the Arab world, dialects are the major communication mode for everyday life; identifying a speaker's dialect is thus critical to speech processing tasks such as automatic speech recognition, as well as speaker identification. We examine the role of prosodic features (intonation and rhythm) across four Arabic dialects: Gulf, Iraqi, Levantine, and Egyptian, for the purpose of automatic dialect identification. We show that prosodic features can significantly improve identification, over a purely phonotactic-based approach, with an identification accuracy of 86.33% for 2m utterances.
The Egyptian Journal of Language Engineering, 2018
In traditional Dialect Identification (DID) approaches, regardless of the level and type of features used for identification, they use either predefined references such as phones, phonemes, or even acoustic sounds that characterize a language/dialect, or involve some sort of transcription of the input data. The transcription may be manual or automatic using tools such as ASRs, Tokenizers, or Phone Recognizers. In this paper, we introduce a new approach based on analyzing the speech signal directly and extracting the features that characterize the dialect without any predefined references and without any sort of transcription. The main idea is that we find the repeated sequences (motifs) of the dialect by treating the speech signal as a times series, so we can apply motif discovery techniques to extract the repeated sequences directly from the speech signal. For motif extraction, we adopted an extremely fast parameter-free Self-Join motif discovery algorithm called Scalable Time series Ordered-search Matrix Profile (STOMP). We implemented the new approach in two stages; in the first we built a base line system in which we extracted 12 Mel Frequency Cepstral Coefficients (MFCC) from each motif, in the second stage we built an improved system using 39 coefficients by adding 13 Delta coefficients, 13 Delta-Delta coefficients, and 1 Log Energy coefficient. In both systems, we used Gaussian Mixture Model-Universal Background Model (GMM-UBM) as a classifier. We applied our new approach on three different motif lengths 500ms, 1000ms, and 1500ms using 1gmm component up to 2048gmm components. We downloaded the data set from Qatar-Computing-Research-Institute domain. We carried out our experiments on different Arabic dialects: the Egyptian (EGY), Gulf (GLF), Levantine (LEV), and North African (NOR).The base line results were very competitive with the traditional, more sophisticated approaches, while the improved system showed very good result. The improvement was so significant that we can consider the new approach as competitive, simple, and dialect-independent approach.
2011
We study the effectiveness of recently developed language recognition techniques based on speech recognition models for the discrimination of Arabic dialects. Specifically, we investigate dialect-specific and cross-dialectal phonotactic models, using both language models and support vector machines (SVMs). Techniques are evaluated both alone and in combination with a cepstral system with joint factor analysis (JFA), using a fourdialect data set employing 30-second telephone speech samples. We find good complementarity from different features and modeling paradigms, and achieve 2% average equal error rate for pairwise classification.
2020
Dialect IDentification (DID) is a challenging task, and it becomes more complicated when it is about the identification of dialects that belong to the same country. Indeed, dialects of the same country are closely related and exhibit a significant overlapping at the phonetic and lexical levels. In this paper, we present our first results on a dialect classification task covering four sub-dialects spoken in Tunisia. We use the term ’sub-dialect’ to refer to the dialects belonging to the same country. We conducted our experiments aiming to discriminate between Tunisian sub-dialects belonging to four different cities: namely Tunis, Sfax, Sousse and Tataouine. A spoken corpus of 1673 utterances is collected, transcribed and freely distributed. We used this corpus to build several speech- and text-based DID systems. Our results confirm that, at this level of granularity, dialects are much better distinguishable using the speech modality. Indeed, we were able to reach an F-1 score of 93.7...
Interspeech 2016, 2016
In this paper, we investigate different approaches for dialect identification in Arabic broadcast speech. These methods are based on phonetic and lexical features obtained from a speech recognition system, and bottleneck features using the i-vector framework. We studied both generative and discriminative classifiers, and we combined these features using a multi-class Support Vector Machine (SVM). We validated our results on an Arabic/English language identification task, with an accuracy of 100%. We also evaluated these features in a binary classifier to discriminate between Modern Standard Arabic (MSA) and Dialectal Arabic, with an accuracy of 100%. We further reported results using the proposed methods to discriminate between the five most widely used dialects of Arabic: namely Egyptian, Gulf, Levantine, North African, and MSA, with an accuracy of 59.2%. We discuss dialect identification errors in the context of dialect code-switching between Dialectal Arabic and MSA, and compare the error pattern between manually labeled data, and the output from our classifier. All the data used on our experiments have been released to the public as a language identification corpus.
2012
Although Arabic is the world's second most spoken language in terms of the number of speakers, Arabic automatic speech recognition (AASR) did not receive the desired attention from the research community. In this paper, we introduce thorough statistical analysis of the Arabic phonemes from a widely used Arabic corpus that was developed by King Fahd
Language Resources and Evaluation, 2016
The Algerian linguistic situation is very intricate due to the ethnic, geographical and colonial occupation influences which have lead to a complex sociolinguistic environment. As a result of the contact between different languages and accents, the Algerian speech community has acquired a distinctive sociolinguistic situation. In addition to the intra-and inter-lingual variations describing day-today linguistic behavior of the Algerian speakers, their speech is characterized by the presence of many linguistic phenomena such as bilingualism and code switching. The study of automatic regional accent recognition in such a type of environment is a new idea in the field of automatic languages, dialect and accent recognition especially that previous studies were conducted using monolingual evaluation data. The assessment of the effectiveness of GMM-UBM and i-vectors frameworks for accent recognition approaches through the use of the Algerian Modern Colloquial Arabic Speech Corpus (AMCASC), which is a linguistic resource collected for this purpose, shows that not only the recording conditions mismatch, channels mismatch, recordings length mismatch and the amplitude clipping which have a non-desirable effect on the effectiveness of these acoustic approaches but also language contact phenomena are other perturbation sources which should be taken into consideration especially in real life applications.
—The recognition of continuous speech is one of the main challenges in the building of automatic speech recognition (ASR) systems, especially when it comes to phonetically complex languages such as Arabic. An ASR system seems to be actually in a blocked alley. Nearly all solutions follow the same general model. The previous research focused on enhancing its performance by incorporating supplementary features. This paper is part of ongoing research efforts aimed at developing a high-performance Arabic speech recognition system for learning and teaching purposes. It investigates a statistical analysis of certain distinctive features of the basic Arabic phonemes which seems helpful in enhancing the performance of a baseline HMM-based ASR system. The statistics are collected using a particular Arabic speech database, which involves ten different male speakers and more than eight hours of speech which covers all Arabic phonemes. In HMM modeling framework, the statistics provided are helpful in establishing the appropriate number of HMM states for each phoneme and they can also be utilized as an initial condition for the EM estimation procedure, which generally, accelerates the estimation process and, thus, improves the performance of the system. The obtained findings are presented and possible applications of automatic speech recognition and speaker identification systems are also suggested.
WSEAS TRANSACTIONS on COMPUTERS, 2022
The Arabic language has many different dialects and it must be recognized before using the automatic speech recognition (ASR). On the other hand, it is observed in all Arab countries that the standard Arabic language is widely written and used in an official speech, newspapers, public administration, and schools but it is not used in daily conversations instead the dialect is widely spoken in daily life and rarely written. In this paper, we examine the difficult task of properly identifying various Arabic dialects and propose a system developed to identify a set of four regional and modern standard Arabic speeches, based on speech recognition using Hidden Markov Models (HMMs) algorithms. HMMs have become a very popular way to build a speech recognition system. It is set as hidden states and possibilities of transition from one state to another. Due to the similarities and differences between the Arabic dialects, speeches collected from the ADI5 datasets were retrieved from the MGB-3 challenge source. We proposed an Arabic Dialect Identification System called "Building a System for Arabic Dialects Identification based on Speech Recognition using Hidden Markov Models (HMMs)" that takes Input as speech utterances and produces output as dialect being spoken. During the training phase, speech utterances from one or more dialects were analyzed to capture the important properties of audio signals in terms of time and frequency. During the testing phase, previously unseen test utterances were utilized to the system, and the system outputs the dialect associated with the model of dialect that most closely matches the test utterance. The proposed model of the system shows promising results of the model for each dialect match.