Helpful Statistics in Recognizing Basic Arabic Phonemes

Mohamed O.M. Khelifa

Outline

Helpful Statistics in Recognizing Basic Arabic Phonemes

Abstract

—The recognition of continuous speech is one of the main challenges in the building of automatic speech recognition (ASR) systems, especially when it comes to phonetically complex languages such as Arabic. An ASR system seems to be actually in a blocked alley. Nearly all solutions follow the same general model. The previous research focused on enhancing its performance by incorporating supplementary features. This paper is part of ongoing research efforts aimed at developing a high-performance Arabic speech recognition system for learning and teaching purposes. It investigates a statistical analysis of certain distinctive features of the basic Arabic phonemes which seems helpful in enhancing the performance of a baseline HMM-based ASR system. The statistics are collected using a particular Arabic speech database, which involves ten different male speakers and more than eight hours of speech which covers all Arabic phonemes. In HMM modeling framework, the statistics provided are helpful in establishing the appropriate number of HMM states for each phoneme and they can also be utilized as an initial condition for the EM estimation procedure, which generally, accelerates the estimation process and, thus, improves the performance of the system. The obtained findings are presented and possible applications of automatic speech recognition and speaker identification systems are also suggested.

(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 2, 2017 Helpful Statistics in Recognizing Basic Arabic Phonemes Mohamed O.M. Khelifa Yahya O.M. ElHadj TES Research Team Doha Institute, Doha, Qatar ENSIAS School of Engineering SAMoVA Research Team, IRIT Mohammed V University in RABAT Paul Sabatier University Rabat, Morocco Toulouse, France Yousfi Abdellah Mostafa Belkasmi FSJES-souissi TES Research Team Mohammed V University in RABAT ENSIAS School of Engineering Rabat, Morocco Mohammed V University in RABAT Rabat, Morocco Abstract—The recognition of continuous speech is one of the certain form of audio/visual/action output. The Applications of main challenges in the building of automatic speech recognition an ASR system can be classified into two main areas. One is (ASR) systems, especially when it comes to phonetically complex dictation, and the other is human-computer dialogue languages such as Arabic. An ASR system seems to be actually in applications. In the dictation area, the broadcast news dictation a blocked alley. Nearly all solutions follow the same general technology has been incorporated into information extraction model. The previous research focused on enhancing its and retrieval technology, and many application systems such as performance by incorporating supplementary features. This retrieval systems and automatic voice document indexing. In paper is part of ongoing research efforts aimed at developing a the human-computer interaction area, a variety of experimental high-performance Arabic speech recognition system for learning systems for information retrieval through spoken dialogue were and teaching purposes. It investigates a statistical analysis of investigated. A common ASR application is the automated certain distinctive features of the basic Arabic phonemes which seems helpful in enhancing the performance of a baseline HMM- conversion of speech into written text, which has the capability based ASR system. The statistics are collected using a particular to increase output effectiveness and enhance access to diverse Arabic speech database, which involves ten different male computer applications such as word processing, email, remote speakers and more than eight hours of speech which covers all control, using phones, language identification, speaker Arabic phonemes. In HMM modeling framework, the statistics identification, and archiving and language acquisition. provided are helpful in establishing the appropriate number of By using speech as input, ASR applications reduces the HMM states for each phoneme and they can also be utilized as an more traditional manual input techniques via keyboards and initial condition for the EM estimation procedure, which generally, accelerates the estimation process and, thus, improves mousses, making it helpful as an alternative input technique for the performance of the system. The obtained findings are people with disabilities. ASR performance may be affected by presented and possible applications of automatic speech various factors, including the quality of the inputted speech, the recognition and speaker identification systems are also suggested. technology design, the surrounding environment and speaker characteristics. Keywords—automatic speech recognition (ASR); speech In spite of the remarkable advances in signal processing, recognizer; phonemes recognition; speech database; hidden computational architectures, algorithms and hardware, ASR Markova models (HMMs) systems is still a topic of an active research and ideal systems I. INTRODUCTION are still far from reached [6]. Thus, the most important research issues should be attacked in order to advance to the ultimate The most communal way for humans to communicate is goal of fluent speech recognition. through sounds made during speech operation. Thoughts and ideas are exchanged via speech. One person speaks and the In speech recognition, it is uncomplicated to recognize other receives the message by means of their ears. Automatic isolated words but the main challenge is to recognize speech recognition (ASR) is the process by which a computer continuous speech. There are two parts for any ASR system: is capable of recognizing and acting upon spoken language or the language model and the acoustic model. The language utterances using particular algorithms [1-5]. It is a branch of model indicates the status of word sequences to be recognized: artificial intelligence (AI) and is related to various areas of are they common or rare? Thereby, the acoustic model is used knowledge, including informatics, linguistics, acoustics, and to model the sounds we produce when we speak. For a small pattern recognition. An ordinary ASR system consists of a vocabulary, it’s easy to model the acoustics of individual microphone unit, speech recognition engine, computer, and a words. As vocabulary size grows, it becomes impractical to 238 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 2, 2017 record sufficient spoken examples of all words and so we need can also be utilized as an initial condition for the Expectation- to model acoustics at a lower level. The state-of-the-art ASR Maximization estimation procedure and hence accelerates the systems do not rely on the whole words in both training and estimation process, or it can be utilized as a wanted model decoding process due to the enormous quantity of words that itself. Also, the probability of the neighboring two phoneme may exist in a speech corpus in addition to the necessity to clusters is helpful information which is not yet integrated in the have sufficient spoken examples for each word. Contrariwise, a adjustment of speech characteristics of possible words from a successful ASR system uses smaller parts of words or sub- dictionary. word units of words that are commonly designed by phoneticians or expert in linguistics. This set of sub-word units The rest of the article is organized as follows: section 2 is referred to as phonemes. summarizes our research efforts accomplished towards the ultimate goal. Section 3 describes the motivation of the Most of the current successful ASR systems are based on presented work. Section 4 introduces a brief overview of the hidden Markov models (HMM) in which each phoneme is previously developed speech database. In Section 5 we present modeled by a set of HMM states. A 3 emitting states with left- the methodology used for statistics extraction. Section 6 gives to-right HMM topology are commonly used for each phoneme the details of the statistical analysis implemented. Finally we independent of its length. Thus, the question that arises is conclude the paper by giving a conclusion in section 7. whether this number of states is sufficient for certain phonemes or is it greater or fewer than what is needed? One of the main II. RESEARCH EFFORTS SUMMARY matters in ASR system is to determine the number of HMM As findings of a previously funded research project [7], two states that reflects the correct length of each phoneme baseline HMM-based systems for phonemes and allophones [8, occurrence in a speech corpus. 9] were constructed using the mentioned speech database. The Despite the sizable utilization of speech recognition number of allophones in the speech database is 110 plus a technologies in foreign languages likes English and French, silence unit which is counted as normal allophone indicating Arabic the rarity of mature ASR-based applications, especially short pauses during the recitations, while the number of for language teaching and learning. One renowned application phonemes is 60, which represents almost half of the number of of Arabic Speech Recognition is the teaching of Classical allophones. All speech units were modeled by an HMM with Arabic (CA) sound system. Although classical Arabic is not three emitting states for both levels to capture their acoustic utilized in everyday communication, it is required for learning properties. And for each state, a Gaussian Mixture Models the Holy Quran (The Muslim Holy Book) and the old Arabic (GMMs) were also associated to designate the characteristics poetry heritage. Moreover, it can open the door for various of the sound portion at this state. The Mel-frequency cepstral sorts of Islamic applications. coefficients (MFCCs) were used as cepstral acoustical features. For each Hamming window of 10 ms, a vector of 39 MFCCs The present paper is part of ongoing research efforts aiming was extracted. These coefficients are the first twelve MFCC to develop a high-performance Arabic speech recognition plus their first and second derivatives to capture the sound's system for learning and teaching purposes. First stages of these static features at this portion. Also, the energy plus its first and efforts were dedicated to the development of particular Arabic second derivatives were appended to identify the sound's speech database including ten different speakers and more than dynamic features at the same portion. The hidden Markov eight hours of speech collected from recitations of the Holy model toolkit (HTK) was employed to train and test the HMMs Quran in which all Arabic phonemes are included. Speech for both systems. The word error rates (WERs) obtained for signals of this speech database were manually and accurately these recognizers were respectively 8% and 12% for phonemes segmented and labeled on three levels: word, phoneme, and and allophones. allophone. Next, two baselines HMM-based recognizers were built to validate the speech segmentation on both phoneme and Our current efforts focalized on the development of an allophone levels and also to examine the intended recognition elaborate system, by firstly considering the basic sounds and accuracy in both recognizers. then looking for their distinctive features to determine which ones will be particularly helpful to well identify their This current stage investigates a statistical analysis of phonological variation. To this end, we have adopted the certain distinctive features in Arabic phonemes in order to speech database to be annotated in terms of basic phonemes. incorporate them later into the speech recognition process for We mean by the basic phonemes the basic sounds without any the aim of improving the performance of our baseline HMM- phonological variation and even without considering the based recognizers. The distinctive features which have been sounds gemination (the doubling). They are 32 phonemes. investigated in this work are phoneme durations, mean Their list and their associated codes are shown in the table 2. durations of phonemes, median of the duration for each basic phoneme, median of the durations, frequency and probability The new version of the speech database was utilized in all occurrences for each basic phoneme. Analysis and efforts yet accomplished, including an HMM-based recognizer interpretations were performed to determine which of these for basic Arabic sounds [10], an enhanced Arabic phonemes distinctive features can significantly enhance systems recognizer using duration modeling techniques [11] and an performance. In HMM modeling framework, the statistics accurate HSMM-based system for Arabic phonemes provided can be helpful in establishing the appropriate number recognition [12]. In the last implemented system for the basic of HMM states for each phoneme which generally increases Arabic phonemes [12], the average recognition rates obtained the speed and recognition accuracy. The phonemes statistics are about 99 %. 239 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 2, 2017 III. BACKGROUND AND MOTIVATION performance is affected by various factors including the Automatic Speech recognition (ASR) seems to be actually existence of noise; the number of HMM states associated with in a blocked alley. Nearly all solutions are of the same general each phoneme; the phoneme combination used and the model [13]. The research focused on enhancing its phonemes length. Enhancing performance of the present ASR performance by integrating supplementary elements. Such an techniques needs the examination of these cited factors in order approach yielded better results but it must be admitted that to localize and recognize the regions of enhancement. there is a limit which cannot be overrun without modification Nonetheless, no fully statistical analysis at the phoneme of the general scheme. The method based on hidden Markov level has been implemented on this speech database of classical models (HMMs) with features of fixed frames length has found Arabic sounds used in this work. Statistical analysis of Arabic its utility in numerous applications. However, it does not seem phonemes gives a comprehensible vision of phonemes to be effective enough to transcribe properly any spoken behavior and provides the capability to regulate this behavior language with a large vocabulary. There are several reasons. by investigating the gathered statistics. For example, the Some of them are very straightforward in their nature. The frequency of a specific phoneme in a speech database can be dictionary-based ASR system will never work correctly for employed to correct its misrecognition during the decoding out-of-dictionary words. Grammar models will not deal process. This means replacing this misrecognized phoneme by correctly with incorrectly spoken utterances while humans very the highest probably one. often can. Furthermore, the average duration of a particular phoneme ASR system tries to recognize speech via these matching can also be utilized to estimate the number of HMM states that techniques, while humans can easily understand it and adopt it are most appropriate for recognizing it. Additional statistical to mistakes and unusual words. This causes the mentioned information such as mode (the midst value in a set of values) limit of the classical ASR approaches. The standard ASR and median (the most frequent value in a set of values) are approach is, indeed, based on guess and luck in few steps of its advantageous in addressing the misrecognized phonemes procedures. The inputted speech is segmented into frames during the decoding process. In this paper, we present a full without any motivated rules. HMM attempts to find the closest statistical analysis of Arabic phonemes which can be employed transcription on the basis of speech features which, indeed, a for the purpose of enhancing performance of our baseline kind of guessing. Such approach works well enough for plainly HMM-based systems by reducing the word error rate (WER) spoken words with a limited vocabulary. Noise, the speaking factor. rate and the large vocabulary cause many exclusions and data missing which HMM cannot deal with correctly. Another IV. SPEECH DATABASE OF SOUNDS major problem is that people do not speak as carefully as they The Arabic language is the official language of about 300 write, while we anticipate a transcription produced by an ASR million speakers around the world. It is the religious language system to be of the grade of our typed texts. of all Muslims around the world, regardless of their native It has also to be admitted by both ordinary users and language. It is the official language in all Arab countries and researchers, that when we speak we do not, at all times, follow the 6th most widely utilized language in terms of first language grammar rules and, furthermore, the mistakes in pronunciation speakers. Arabic can be categorized into two main variants: involve various exceptions independently of the dictionary size Classical Arabic (CA) and Modern Standard Arabic (MSA). used. This is why adopting a hypothesis using related language CA is an old literary form of Arabic, which is the most formal rules and a limited dictionary does not always work type and is the language of the Holy Quran and the old Arabic satisfactorily. The same issues take place in the case of names, poetry. MSA is the current standard form of Arabic, which is out-of-language words, and the mispronounced phonemes, etc. utilized in official communications in Arabic countries, ASR system attempts to adopt the inputted speech to the broadcast news, formal speeches, etc. Although there is no big language rules and the static vocabulary, which, in certain difference between today's Arabic (MSA) and that spoken by cases, leads to supplementary distortions and hence to the early Arabs (CA), due to the fact that Arabic is one of the degradation in system performance. most stable languages throughout history, yet there are some idiosyncrasies as to the way of pronunciation. There is no straightforward solution for the above- described problems. In this work, we suggest the use of One of the main barriers faced by the development of ASR collected phoneme statistics in a target language in order to be applications for Arabic speech is the rarity of suitable sound used as, for instance, a support for the dictionary if there is a databases commonly required for training and testing statistical difficulty in associating matching features to one of the words models. This problem is seriously approached when dealing to be recognized in the vocabulary. with classical Arabic language since most of the corpora available nowadays are specifically oriented towards what is The most outstanding research works carried out on known as Modern Standard Arabic (MSA) and its sub-forms continuous speech is based on statistical approaches (i.e. dialects). To remedy this problem and to assist the specifically Hidden Markov Models (HMM). Many HMM- development of ASR applications for classical Arabic based ASR systems for continuous Arabic speech have reached language, a speech database covering all classical Arabic various levels of recognition accuracy and encouraging sounds was designed on the basis of Quranic recitations. The performances which have been achieved [14-18]. The accuracy speech corpus was developed in a previously funded project by of recognition is usually measured by the correct percentage of Al-Imam Muhammad ibn Saud Islamic University in Saudi recognized phonemes. The HMM-based ASR systems Arabia with the support of King Abed Al-Aziz City for Science 240 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 2, 2017 and Technology (KACST). Because of the difficulty of V. STATISTICS EXTRACTION METHOD developing this kind of corpora, only a part of the Holy Quran To extract statistics from the speech database, a computer was regarded. Recitations of ten male speakers were recorded program was designed using MATLAB programming language in an appropriate environment under the supervision of an developed by MathWorks [22]. The occurrence probability of expert of the holy Quran pronunciation rules (called Tajweed); each basic phoneme, frequency of occurrence of basic more than eight hours of speech were achieved [19-21]. Each phoneme, mean duration, Min and Max durations for each audio file is a Quranic verse or a portion of it for long verses basic phoneme, mode and the median of duration for each where the speaker must take a long breath. basic phoneme were calculated. Durations are computed on the In order to have a speech database useful for many goals, basis of phonemes boundary extracted from TextGrids files speech signals were manually and accurately segmented into attached withal the speech database Sound. three levels: word, phoneme and allophone. A new labeling These gathered statistics are displayed in Table 3 (see Table system was proposed to annotate the speech segments [16] III) which also shows the labels used for every basic phoneme because the labeling systems available (e.g. IPA, SAMPA, in the speech database. Fig. 1 shows the mean of basic BEEP, etc.) were not able to cover all Arabic sounds. phonemes durations measured in second. The frequency of However, the speech database consists of 44.1 KHz wav files each basic phoneme in the whole database is shown in Fig. 2. of 16 millisecond utterances over its corresponding MFCC For an in-depth analysis of the collected statistic and for the feature files, label files and TextGrids files. purpose to have extra information about the characteristics of Table I lists for each speaker, the number of sound files, the basic Arabic phonemes, useful graphs are depicted in their size and duration. The list of basic Arabic phonemes and Figures 3, 4,5 and 6. their associated codes are shown in table II. M e a n P h o n e m e D u ra tio n TABLE I. SOUND FILES AND THEIR DURATION BY SPEAKERS 0 .4 Speaker Speaker Number of Duration Size M e a n D u r a tio n in S e c o n d Number Initials Sound Files (minutes) (MB) 1 AAH 600 49.36 249 0 .3 2 AAS 590 52.09 261 3 AMS 612 45.78 229 4 ANS 597 49.72 250 0 .2 5 BAN 585 54.75 276 6 FFA 578 44.11 220 7 HSS 601 49.76 251 0 .1 8 MAS 580 46.24 232 9 MAZ 608 51.47 258 10 SKG 584 44.29 220 0 .0 s il js 1 0 is 1 0 ls 1 0 ms10 ws10 db10 hb10 v b10 ns10 sb10 ds10 us10 qs10 gs10 v s10 hs10 bs10 z b10 hz 10 cs10 ss10 xs10 as10 ks10 z s10 ys 1 0 tb 1 0 rs10 ts 1 0 jb 1 0 fs 1 0 487.53 Total 5935 2446 (8h, 8m) B a s ic A r a b ic P h o n e m e s TABLE II. LIST OF BASIC ARABIC PHONEMES AND THEIR CODES Arabic Arabic Fig. 1. Mean Duration of the Basic Arabic Phonemes Label Label Orthography Orthography ‫ـَـ‬ ‫فححة‬ as10 ‫ص‬ ‫صاد‬ sb10 ‫ـُـ‬ ‫ضمة‬ us10 ‫ض‬ ‫ضاد‬ db10 F re q u e n c y ِ‫ـ‬ ‫كسرة‬ is10 ‫ط‬ ‫طاء‬ tb10 50000 ‫ء‬ ‫همسة‬ hz10 ‫ظ‬ ‫ظاء‬ zb10 ‫ب‬ ‫باء‬ bs10 ‫ع‬ ‫عيه‬ cs10 ‫ت‬ ‫جاء‬ ts10 ‫غ‬ ‫غيه‬ gs10 40000 ‫خ‬ ‫ثاء‬ vs10 ‫ف‬ ‫فاء‬ fs10 F re q u e n c y ‫ج‬ ‫جيم‬ jb10 ‫ق‬ ‫قاف‬ qs10 30000 ‫ح‬ ‫حاء‬ hb10 ‫ك‬ ‫كاف‬ ks10 ‫خ‬ ‫خاء‬ xs10 ‫ل‬ ‫الم‬ ls10 20000 ‫د‬ ‫دال‬ ds10 ‫م‬ ‫ميم‬ ms10 ‫ذ‬ ‫ذال‬ vb10 ‫ن‬ ‫وون‬ ns10 10000 ‫ر‬ ‫راء‬ rs10 ‫هـ‬ ‫هاء‬ hs10 ‫ز‬ ‫زاء‬ zs10 ‫و‬ ‫واو‬ ws10 ‫ش‬ ‫سيه‬ ss10 ‫ي‬ ‫ياء‬ ys10 0 s il js 1 0 is 1 0 ls 1 0 ms10 ws10 db10 hb10 v b10 ns10 sb10 ds10 us10 qs10 gs10 v s10 hs10 bs10 z b10 hz 10 cs10 ss10 xs10 as10 ks10 z s10 ys 1 0 tb 1 0 rs10 ts 1 0 jb 1 0 fs 1 0 ‫ش‬ ‫شيه‬ js10 ‫صامث‬ sil In addition, the speech database contains a list of 60 Arabic B a s ic A r a b ic P h o n e m e s L a b e l s phonemes, an Arabic dictionary, a list of all unrepeated words included in the whole eight hours speech database and other Fig. 2. Basic Arabic Phonemes Frequencies useful files needed for the recognizer development. 241 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 2, 2017 A r a b ic P h o n e m e s O c c u r a n c e P r o b a b ilit y be helpful for the purpose of enhancing performance of the baseline recognizer as we will evoke in the next sections. Basic 0 .3 phoneme duration medians give a clearly view of those clusters. Classes of the phonemes groups are being differentiated from each other and a clear parting among 0 .2 P r o b a b ilit y phoneme groups becomes more obvious, as seen in Fig. 5. C la s s e d A ra b ic P h o n e m e s B a s e d O n D u ra tio n M o d e 0 .1 0 .2 5 D u r a t io n M o d e In S e c o n d 0 .2 0 0 .0 s il js 1 0 is 1 0 ls 1 0 ms10 ws10 db10 hb10 v b10 ns10 sb10 ds10 us10 qs10 gs10 v s10 hs10 bs10 z b10 hz 10 cs10 ss10 xs10 as10 ks10 z s10 ys 1 0 tb 1 0 rs10 ts 1 0 jb 1 0 fs 1 0 0 .1 5 B a s ic A r a b ic P h o n e m e s 0 .1 0 Fig. 3. Basic Arabic Phonemes Occurrence Probability 0 .0 5 0 .0 0 s il ls 1 0 ms10 is 1 0 js 1 0 ws10 v b10 hb10 db10 ns10 hz 10 z b10 ds10 gs10 bs10 hs10 us10 v s10 qs10 sb10 C la s s e d A r a b ic P h o n e m e s B a s e d O n M e a n D u r a t io n z s10 cs10 ks10 as10 ss10 xs10 ys 1 0 rs10 tb 1 0 jb 1 0 fs 1 0 ts 1 0 0 .4 B a s ic A r a b ic P h o n e m e s M e a n D u r a tio n in S e c o n d 0 .3 Fig. 6. Sorted Basic Arabic Phonemes based on their Modes 0 .2 Another significant graph is the one demonstrating the most frequent duration value of all occurrences of a basic phoneme appearing in the "CA Sound Database". This is referred as the 0 .1 mode, and is displayed in Fig. 6. 0 .0 VI. STATISTICS ANALYSIS s il ls 1 0 js 1 0 ms10 is 1 0 ws10 v b10 hb10 db10 hz 10 hs10 z b10 v s10 gs10 bs10 qs10 ds10 sb10 us10 ns10 cs10 ks10 z s10 xs10 ss10 as10 ys 1 0 rs10 tb 1 0 fs 1 0 jb 1 0 ts 1 0 When taking a look at the previous tables and graphs, we B a s ic A r a b ic P h o n e m e s L a b e ls find that each basic phoneme occurs with various frequencies, the highest frequent ones are “as10” (‫)فححة‬, is10” (‫ )كسـرة‬and Fig. 4. Sorted Basic Arabic Phonemes based on their Means “us10” (‫)ضـمة‬, respectively, which designate the Arabic vowels. Otherwise the smallest frequent ones are “zb10” ( ‫حرف‬ ‫)الظاء‬, “gs10” (‫)حرف الغيه‬, and “zs10” (‫)حرف الساء‬, respectively, C la s s e d A r a b ic P h o n e m e s B a s e d O n D u r a t io n M e d ia n s ignoring the phoneme denoting the silence “sil” (‫)صامث‬. From 0 .3 the results shown in Figures 2 and 3; it seems clear that when a M e d ia n D u ra tio n in S e c o n d phoneme is missed throughout the decoding process, phoneme "as10" is automatically the most probable one replacing it. 0 .2 Generally, the results concluded from Fig. 3 can be employed to correct the pronunciations for a misrecognized phoneme in 0 .1 spoken utterances during the recognition phase. The use of this information seems useful in enhancing the baseline system performance. 0 .0 Fig. 4 illustrates the entire basic Arabic phonemes sorted on s il ms10 ls 1 0 is 1 0 js 1 0 ws10 v b10 hb10 db10 hz 10 z b10 bs10 hs10 v s10 gs10 qs10 us10 ds10 sb10 ns10 z s10 ks10 as10 xs10 ss10 cs10 ys 1 0 rs10 tb 1 0 fs 1 0 jb 1 0 ts 1 0 the basis of their average durations. From this Figure, we can B a s ic A r a b ic P h o n e m e s clearly show the behavior of the basic phoneme durations through the whole speech database. Thus, the figure provides Fig. 5. Sorted Basic Arabic Phonemes based on their Medians an explicit idea about the average duration of each phoneme, which means that a basic phoneme clusters being distinguished Fig. 3 shows the occurrence probability of the basic Arabic from it. For example, the basic phonemes “hz10” and “rs10” phonemes in the whole speech database. This useful graph will form the first cluster. The second cluster includes: serve in defining the probability of missing phonemes during “vb10”,”fs10” and “hs10”. The vowels form the last cluster in the decoding process. However, we noted that the phoneme terms of the highest average durations. Usually, knowing the “sil” denoting the silence regardless of its occurring places in average length of a specific phoneme in a speech database can the speech database is included in all depicted graphs. be utilized for estimating the appropriate number of the HMM states that represent it, which generally accelerate the In interesting outcome which is apparent from Fig. 4 proves estimation period and hence enhance the accuracy of that basic phonemes having equal or approximate mean values recognition. can be grouped into clusters. we assume that these clusters will 242 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 2, 2017 In Fig. 5 and Fig. 6, median and mode durations for each [4] H. Jiang, “Discriminative training for automatic speech recognition: A basic phoneme are displayed, where the basic phonemes survey,” Computer Speech & Language, Comput. Speech, vol. 24, no. 4, pp. 589–608, 2010. clusters appear clearly. The outcomes of both figures could be [5] L. Deng and X. Li, "Machine Learning paradigms for speech helpful to make the correct decision in dealing with either recognition: An overview,” IEEE Trans. Audio, Speech, Lang. Process., misrecognized or missed phonemes. It means that replacing vol. 21, no. 5, pp. 1060–1089, May 2013. them with the near median or mode phoneme. [6] I, Oparin, Language Models for Automatic Speech Recognition Of inflectional Languages. PhD Thesis, University of West Bohemia, Plzen, VII. CONCLUSION Czech Republic (2009). In this paper, we have presented a collection of statistical [7] Y.O.M. Elhadj, I.A. Alsughayeir, M. Alghamdi, M. Alkanhal, Y.M. data for Basic Arabic phonemes helpful in enhancing HMM- Ohali, A.M. Alansari, Computerized teaching of the Holy Quran (in Arabic), Final Technical Report, King Abdulaziz City for Sciences and based automatic speech recognition systems performance. In Technology (KACST), Riyadh, KSA,2012. the literature, the duration of phonemes is regarded as major [8] Y.O.M. Elhadj, M. Alghamdi, and M. Alkanhal, “Phoneme-Based distinctive feature characterizing the voice of a speaker. Recognizer to Assist Reading the Holy Quran,”, Recent Advances in Knowing the duration of a particular phoneme in a spoken Intelligent Informatics, Advances in Intelligent Systems and Computing, utterances can be utilized to estimate the length of the HMM Springer, pp.141-152,2014. chain describing it, which in consequence improves the system [9] Y.O.M. Elhadj, M. Alghamdi, and M. Alkanhal, “Approach for performance. These investigations were performed using a Recognizing Allophonic Sounds of the Classical Arabic Based on Quran Recitations,”, Theory and Practice of Natural Computing, Lecture Notes particular speech database of Quranic sounds including more in Computer Science, Springer, pp. 57-67, 2013. than eight hours of speech and ten different male speakers. [10] Y.O.M. Elhadj, Mohamed .O.M. Khelifa, A. Yousfi and M. Belkasmi. The numerical values are extracted using a computer program “An Accurate Recognizer for Basic Arabic Sounds,” ARPN Journal of designed for this purpose. A discussion of these results with Engineering and Applied Sciences, vol. 11, no. 5, pp. 3239- 3243, Mar. interpretations was also presented and reported graphically. 2016. Dividing phonemes into clusters on the basis of their median of [11] Mohamed O.M. Khelifa, Y.O.M. Elhadj, Y. Abdellah and M. Belkasmi, the durations can help in decreasing the search for the “Enhancing Arabic Phoneme Recognizer using Duration Modeling Techniques,”, in proc. of Fourth International Conference on Advances appropriate phoneme during the decoding process, which in in Computing, Electronics and Communication - ACEC 2016, Dec 15, consequence increases system performance. Collected statistics 2016, Rome-Italy. provided can also be used to build or propose other techniques [12] Mohamed O.M. Khelifa, Y.O.M. Elhadj, Y. Abdellah and M. Belkasmi, for phonemes classifications. While the probability “An Accurate HSMM-based System for Arabic phonemes Recognition,” distributions in HMM-based ASR systems are usually in proc. of The IEEE Ninth International conference on Advanced estimated with the Expectation-Maximization iterative Computational Intelligence (ICACI 2017), Feb. 2, 2017, Doha, Qatar. algorithm, the statistics provided can be utilized as an initial [13] S. Young, Large Vocabulary Continuous Speech Recognition: a Review, IEEE Signal Processing Magazine 13(5), pp. 45-57, 1996. condition for the estimation procedure, and, thus, speed up its execution time, or can also be utilized as a wanted model itself. [14] Ali, A. et al., “A Complete KALDI Recipe for Building Arabic Speech Recognition Systems”, Spoken Language Technology Workshop (SLT), We believe that the absence of necessary numerical data IEEE, 2014. denoting, particularly, the basic Arabic phonemes behavior in [15] Khalid, A. et al., "Arabic Phonemes Transcription using Data classical Arabic language like those reported here gives an Driven,"The International Arab Journal of Information Technology, Vol. added value to the presented work. However, our future steps 12, No. 3, May 2015. will focus on incorporating these statistics explicitly into [16] Speaker-dependant continuous Arabic speech recognition. M.Sc. thesis, HMMs in order to overcoming the classical HMM's weakness King Saud University, 2001. and, hence, improve HMM-based systems performance. [17] Hyassat H, Abu Zitar, “Arabic speech recognition using SPHINX engine,”, Int J Speech Tech 9(3–4):133–150, 2008. ACKNOWLEDGMENT [18] Azmi, M. et al., “Syllable-based automatic Arabic speech recognition in noisy-telephone channel,”, In: WSEAS transactions on signal processing The presented work utilizes the results (Classical Arabic proceedings, World Scientific and Engineering Academy and Society Sound Database) of a project previously funded by King Abed (WSEAS), vol 4, issue 4, pp 211–220, 2008. Al-Aziz City for Science and Technology (KACST) in Saudi [19] Y.O.M. Elhadj, M. et al., Design and Development of a High Quality Arabia under grant number “AT – 25 – 113”. Speech Corpus for Classical Arabic. Submitted for publication to the Language Resources and Evalauation Journal (LREV). REFERENCES [20] Y.O.M. Elhadj, M. et al., Sound Corpus of a part of the noble Quran (in [1] D. Jurafsky and J. H. Martin, Speech and Language Processing, 2nd ed., Arabic). Proc. of the International Conference on the Glorious Quran Pearson Prentice Hall, 2009. and Contemporary Technologies, King Fahd Complex for the Printing of [2] G. Zweig and P. Nguyen, “A segmental CRF approach to large the Holy Quran, Almadinah, Saudi Arabia, October 13-15, 2009. vocabulary continuous speech eecognition,” Proc.of IEEE ASRU, 2009. [21] Y.O.M. Elhadj. Preparation of speech database with perfect reading of [3] H. Sakoe, Two-level DP-matching - a dynamic programming-based the last part of the Holly Quran (in Arabic). Proc. of the 3rd IEEE pattern matching algorithm for connected word recognition, Readings in International Conference on Arabic Language Processing (CITAL'09), Speech Recognition, Morgan Kaufmann Publishers Inc, pp. 180-186, pp: 5-8, Rabat, Morocco, May 4-5, 2009. 1990. [22] MATLAB and Statistics Toolbox Release 2013a The MathWorks, Inc., Natick, Massachusetts, United States. 243 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 2, 2017 TABLE III. THE BASIC ARABIC PHONEMES STATISTICS Basic Arabic Max duration Frequency of Min duration Mean-duration Probability of Phonemes Labels in second Mode Median occurrence in second in second occurrence ‫صامث‬ sil 11875 0.022 8.576 0.315 0.230 0.282 0.077 ‫وون‬ ns10 8160 0.021 1.458 0.364 0.068 0.195 0.052 ‫عيه‬ cs10 2700 0.033 0.420 0.124 0.099 0.155 0.017 ‫صاد‬ sb10 838 0.079 0.388 0.183 0.128 0.153 0.005 ‫سيه‬ ss10 2175 0.071 0.384 0.170 0.136 0.149 0.014 ‫خاء‬ xs10 770 0.072 0.420 0.151 0.139 0.139 0.004 ‫دال‬ ds10 2190 0.039 0.433 0.162 0.083 0.136 0.014 ‫شيه‬ js10 867 0.080 0.478 0.152 0.130 0.136 0.005 ‫فححة‬ as10 40396 0.011 3.343 0.207 0.130 0.135 0.262 ‫كسرة‬ is10 12755 0.030 1.833 0.207 0.121 0.135 0.082 ‫ضمة‬ us10 9110 0.029 1.739 0.214 0.110 0.135 0.059 ‫قاف‬ qs10 1870 0.080 0.792 0.151 0.123 0.130 0.012 ‫ضاد‬ db10 443 0.021 0.629 0.155 0.124 0.128 0.002 ‫طاء‬ tb10 560 0.073 0.464 0.163 0.110 0.128 0.003 ‫غيه‬ gs10 410 0.049 0.387 0.138 0.083 0.123 0.002 ‫الم‬ ls10 9066 0.015 0.767 0.146 0.069 0.123 0.058 ‫حاء‬ hb10 1457 0.050 0.335 0.127 0.114 0.122 0.009 ‫جاء‬ ts10 3483 0.019 0.959 0.141 0.114 0.121 0.022 ‫ياء‬ ys10 3677 0.019 1.392 0.150 0.100 0.120 0.023 ‫كاف‬ ks10 3040 0.028 0.480 0.136 0.105 0.119 0.019 ‫ثاء‬ vs10 600 0.032 0.311 0.117 0.117 0.112 0.003 ‫زاء‬ zs10 440 0.060 0.352 0.138 0.094 0.111 0.002 ‫جيم‬ jb10 1240 0.015 0.428 0.130 0.097 0.108 0.008 ‫فاء‬ fs10 3020 0.016 0.369 0.109 0.113 0.105 0.019 ‫هـاء‬ hs10 4559 0.029 0.376 0.113 0.100 0.105 0.029 ‫باء‬ bs10 3739 0.012 0.654 0.144 0.085 0.104 0.024 ‫واو‬ ws10 4647 0.016 1.021 0.124 0.085 0.104 0.030 ‫ميم‬ ms10 6825 0.027 1.640 0.170 0.080 0.099 0.044 ‫ظاء‬ zb10 176 0.054 0.360 0.114 0.082 0.096 0.001 ‫ذال‬ vb10 2091 0.031 0.371 0.110 0.076 0.087 0.013 ‫همسة‬ hz10 6281 0.008 0.295 0.078 0.074 0.076 0.040 ‫راء‬ rs10 4620 0.014 0.403 0.096 0.066 0.075 0.029 244 | P a g e www.ijacsa.thesai.org

References (22)

D. Jurafsky and J. H. Martin, Speech and Language Processing, 2nd ed., Pearson Prentice Hall, 2009.
G. Zweig and P. Nguyen, "A segmental CRF approach to large vocabulary continuous speech eecognition," Proc.of IEEE ASRU, 2009.
H. Sakoe, Two-level DP-matching -a dynamic programming-based pattern matching algorithm for connected word recognition, Readings in Speech Recognition, Morgan Kaufmann Publishers Inc, pp. 180-186, 1990.
H. Jiang, "Discriminative training for automatic speech recognition: A survey," Computer Speech & Language, Comput. Speech, vol. 24, no. 4, pp. 589-608, 2010.
L. Deng and X. Li, "Machine Learning paradigms for speech recognition: An overview," IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 5, pp. 1060-1089, May 2013.
I, Oparin, Language Models for Automatic Speech Recognition Of inflectional Languages. PhD Thesis, University of West Bohemia, Plzen, Czech Republic (2009).
Y.O.M. Elhadj, I.A. Alsughayeir, M. Alghamdi, M. Alkanhal, Y.M. Ohali, A.M. Alansari, Computerized teaching of the Holy Quran (in Arabic), Final Technical Report, King Abdulaziz City for Sciences and Technology (KACST), Riyadh, KSA,2012.
Y.O.M. Elhadj, M. Alghamdi, and M. Alkanhal, "Phoneme-Based Recognizer to Assist Reading the Holy Quran,", Recent Advances in Intelligent Informatics, Advances in Intelligent Systems and Computing, Springer, pp.141-152,2014.
Y.O.M. Elhadj, M. Alghamdi, and M. Alkanhal, "Approach for Recognizing Allophonic Sounds of the Classical Arabic Based on Quran Recitations,", Theory and Practice of Natural Computing, Lecture Notes in Computer Science, Springer, pp. 57-67, 2013.
Y.O.M. Elhadj, Mohamed .O.M. Khelifa, A. Yousfi and M. Belkasmi. "An Accurate Recognizer for Basic Arabic Sounds," ARPN Journal of Engineering and Applied Sciences, vol. 11, no. 5, pp. 3239-3243, Mar. 2016.
Mohamed O.M. Khelifa, Y.O.M. Elhadj, Y. Abdellah and M. Belkasmi, "Enhancing Arabic Phoneme Recognizer using Duration Modeling Techniques,", in proc. of Fourth International Conference on Advances in Computing, Electronics and Communication -ACEC 2016, Dec 15, 2016, Rome-Italy.
Mohamed O.M. Khelifa, Y.O.M. Elhadj, Y. Abdellah and M. Belkasmi, "An Accurate HSMM-based System for Arabic phonemes Recognition," in proc. of The IEEE Ninth International conference on Advanced Computational Intelligence (ICACI 2017), Feb. 2, 2017, Doha, Qatar.
S. Young, Large Vocabulary Continuous Speech Recognition: a Review, IEEE Signal Processing Magazine 13(5), pp. 45-57, 1996.
Ali, A. et al., "A Complete KALDI Recipe for Building Arabic Speech Recognition Systems", Spoken Language Technology Workshop (SLT), IEEE, 2014.
Khalid, A. et al., "Arabic Phonemes Transcription using Data Driven,"The International Arab Journal of Information Technology, Vol. 12, No. 3, May 2015.
Speaker-dependant continuous Arabic speech recognition. M.Sc. thesis, King Saud University, 2001.
Hyassat H, Abu Zitar, "Arabic speech recognition using SPHINX engine,", Int J Speech Tech 9(3-4):133-150, 2008.
Azmi, M. et al., "Syllable-based automatic Arabic speech recognition in noisy-telephone channel,", In: WSEAS transactions on signal processing proceedings, World Scientific and Engineering Academy and Society (WSEAS), vol 4, issue 4, pp 211-220, 2008.
Y.O.M. Elhadj, M. et al., Design and Development of a High Quality Speech Corpus for Classical Arabic. Submitted for publication to the Language Resources and Evalauation Journal (LREV).
Y.O.M. Elhadj, M. et al., Sound Corpus of a part of the noble Quran (in Arabic). Proc. of the International Conference on the Glorious Quran and Contemporary Technologies, King Fahd Complex for the Printing of the Holy Quran, Almadinah, Saudi Arabia, October 13-15, 2009.
Y.O.M. Elhadj. Preparation of speech database with perfect reading of the last part of the Holly Quran (in Arabic). Proc. of the 3rd IEEE International Conference on Arabic Language Processing (CITAL'09), pp: 5-8, Rabat, Morocco, May 4-5, 2009.
MATLAB and Statistics Toolbox Release 2013a The MathWorks, Inc., Natick, Massachusetts, United States.

Helpful Statistics in Recognizing Basic Arabic Phonemes

Sign up for access to the world's latest research

Abstract

Related papers

References (22)

Related papers

Related topics