(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 2, 2017
Helpful Statistics in Recognizing Basic Arabic
Phonemes
Mohamed O.M. Khelifa Yahya O.M. ElHadj
TES Research Team Doha Institute, Doha, Qatar
ENSIAS School of Engineering SAMoVA Research Team, IRIT
Mohammed V University in RABAT Paul Sabatier University
Rabat, Morocco Toulouse, France
Yousfi Abdellah Mostafa Belkasmi
FSJES-souissi TES Research Team
Mohammed V University in RABAT ENSIAS School of Engineering
Rabat, Morocco Mohammed V University in RABAT
Rabat, Morocco
Abstract—The recognition of continuous speech is one of the certain form of audio/visual/action output. The Applications of
main challenges in the building of automatic speech recognition an ASR system can be classified into two main areas. One is
(ASR) systems, especially when it comes to phonetically complex dictation, and the other is human-computer dialogue
languages such as Arabic. An ASR system seems to be actually in applications. In the dictation area, the broadcast news dictation
a blocked alley. Nearly all solutions follow the same general technology has been incorporated into information extraction
model. The previous research focused on enhancing its and retrieval technology, and many application systems such as
performance by incorporating supplementary features. This retrieval systems and automatic voice document indexing. In
paper is part of ongoing research efforts aimed at developing a the human-computer interaction area, a variety of experimental
high-performance Arabic speech recognition system for learning
systems for information retrieval through spoken dialogue were
and teaching purposes. It investigates a statistical analysis of
investigated. A common ASR application is the automated
certain distinctive features of the basic Arabic phonemes which
seems helpful in enhancing the performance of a baseline HMM- conversion of speech into written text, which has the capability
based ASR system. The statistics are collected using a particular to increase output effectiveness and enhance access to diverse
Arabic speech database, which involves ten different male computer applications such as word processing, email, remote
speakers and more than eight hours of speech which covers all control, using phones, language identification, speaker
Arabic phonemes. In HMM modeling framework, the statistics identification, and archiving and language acquisition.
provided are helpful in establishing the appropriate number of
By using speech as input, ASR applications reduces the
HMM states for each phoneme and they can also be utilized as an
more traditional manual input techniques via keyboards and
initial condition for the EM estimation procedure, which
generally, accelerates the estimation process and, thus, improves
mousses, making it helpful as an alternative input technique for
the performance of the system. The obtained findings are people with disabilities. ASR performance may be affected by
presented and possible applications of automatic speech various factors, including the quality of the inputted speech, the
recognition and speaker identification systems are also suggested. technology design, the surrounding environment and speaker
characteristics.
Keywords—automatic speech recognition (ASR); speech
In spite of the remarkable advances in signal processing,
recognizer; phonemes recognition; speech database; hidden
computational architectures, algorithms and hardware, ASR
Markova models (HMMs)
systems is still a topic of an active research and ideal systems
I. INTRODUCTION are still far from reached [6]. Thus, the most important research
issues should be attacked in order to advance to the ultimate
The most communal way for humans to communicate is goal of fluent speech recognition.
through sounds made during speech operation. Thoughts and
ideas are exchanged via speech. One person speaks and the In speech recognition, it is uncomplicated to recognize
other receives the message by means of their ears. Automatic isolated words but the main challenge is to recognize
speech recognition (ASR) is the process by which a computer continuous speech. There are two parts for any ASR system:
is capable of recognizing and acting upon spoken language or the language model and the acoustic model. The language
utterances using particular algorithms [1-5]. It is a branch of model indicates the status of word sequences to be recognized:
artificial intelligence (AI) and is related to various areas of are they common or rare? Thereby, the acoustic model is used
knowledge, including informatics, linguistics, acoustics, and to model the sounds we produce when we speak. For a small
pattern recognition. An ordinary ASR system consists of a vocabulary, it’s easy to model the acoustics of individual
microphone unit, speech recognition engine, computer, and a words. As vocabulary size grows, it becomes impractical to
238 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 2, 2017
record sufficient spoken examples of all words and so we need can also be utilized as an initial condition for the Expectation-
to model acoustics at a lower level. The state-of-the-art ASR Maximization estimation procedure and hence accelerates the
systems do not rely on the whole words in both training and estimation process, or it can be utilized as a wanted model
decoding process due to the enormous quantity of words that itself. Also, the probability of the neighboring two phoneme
may exist in a speech corpus in addition to the necessity to clusters is helpful information which is not yet integrated in the
have sufficient spoken examples for each word. Contrariwise, a adjustment of speech characteristics of possible words from a
successful ASR system uses smaller parts of words or sub- dictionary.
word units of words that are commonly designed by
phoneticians or expert in linguistics. This set of sub-word units The rest of the article is organized as follows: section 2
is referred to as phonemes. summarizes our research efforts accomplished towards the
ultimate goal. Section 3 describes the motivation of the
Most of the current successful ASR systems are based on presented work. Section 4 introduces a brief overview of the
hidden Markov models (HMM) in which each phoneme is previously developed speech database. In Section 5 we present
modeled by a set of HMM states. A 3 emitting states with left- the methodology used for statistics extraction. Section 6 gives
to-right HMM topology are commonly used for each phoneme the details of the statistical analysis implemented. Finally we
independent of its length. Thus, the question that arises is conclude the paper by giving a conclusion in section 7.
whether this number of states is sufficient for certain phonemes
or is it greater or fewer than what is needed? One of the main II. RESEARCH EFFORTS SUMMARY
matters in ASR system is to determine the number of HMM As findings of a previously funded research project [7], two
states that reflects the correct length of each phoneme baseline HMM-based systems for phonemes and allophones [8,
occurrence in a speech corpus. 9] were constructed using the mentioned speech database. The
Despite the sizable utilization of speech recognition number of allophones in the speech database is 110 plus a
technologies in foreign languages likes English and French, silence unit which is counted as normal allophone indicating
Arabic the rarity of mature ASR-based applications, especially short pauses during the recitations, while the number of
for language teaching and learning. One renowned application phonemes is 60, which represents almost half of the number of
of Arabic Speech Recognition is the teaching of Classical allophones. All speech units were modeled by an HMM with
Arabic (CA) sound system. Although classical Arabic is not three emitting states for both levels to capture their acoustic
utilized in everyday communication, it is required for learning properties. And for each state, a Gaussian Mixture Models
the Holy Quran (The Muslim Holy Book) and the old Arabic (GMMs) were also associated to designate the characteristics
poetry heritage. Moreover, it can open the door for various of the sound portion at this state. The Mel-frequency cepstral
sorts of Islamic applications. coefficients (MFCCs) were used as cepstral acoustical features.
For each Hamming window of 10 ms, a vector of 39 MFCCs
The present paper is part of ongoing research efforts aiming was extracted. These coefficients are the first twelve MFCC
to develop a high-performance Arabic speech recognition plus their first and second derivatives to capture the sound's
system for learning and teaching purposes. First stages of these static features at this portion. Also, the energy plus its first and
efforts were dedicated to the development of particular Arabic second derivatives were appended to identify the sound's
speech database including ten different speakers and more than dynamic features at the same portion. The hidden Markov
eight hours of speech collected from recitations of the Holy model toolkit (HTK) was employed to train and test the HMMs
Quran in which all Arabic phonemes are included. Speech for both systems. The word error rates (WERs) obtained for
signals of this speech database were manually and accurately these recognizers were respectively 8% and 12% for phonemes
segmented and labeled on three levels: word, phoneme, and and allophones.
allophone. Next, two baselines HMM-based recognizers were
built to validate the speech segmentation on both phoneme and Our current efforts focalized on the development of an
allophone levels and also to examine the intended recognition elaborate system, by firstly considering the basic sounds and
accuracy in both recognizers. then looking for their distinctive features to determine which
ones will be particularly helpful to well identify their
This current stage investigates a statistical analysis of phonological variation. To this end, we have adopted the
certain distinctive features in Arabic phonemes in order to speech database to be annotated in terms of basic phonemes.
incorporate them later into the speech recognition process for We mean by the basic phonemes the basic sounds without any
the aim of improving the performance of our baseline HMM- phonological variation and even without considering the
based recognizers. The distinctive features which have been sounds gemination (the doubling). They are 32 phonemes.
investigated in this work are phoneme durations, mean Their list and their associated codes are shown in the table 2.
durations of phonemes, median of the duration for each basic
phoneme, median of the durations, frequency and probability The new version of the speech database was utilized in all
occurrences for each basic phoneme. Analysis and efforts yet accomplished, including an HMM-based recognizer
interpretations were performed to determine which of these for basic Arabic sounds [10], an enhanced Arabic phonemes
distinctive features can significantly enhance systems recognizer using duration modeling techniques [11] and an
performance. In HMM modeling framework, the statistics accurate HSMM-based system for Arabic phonemes
provided can be helpful in establishing the appropriate number recognition [12]. In the last implemented system for the basic
of HMM states for each phoneme which generally increases Arabic phonemes [12], the average recognition rates obtained
the speed and recognition accuracy. The phonemes statistics are about 99 %.
239 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 2, 2017
III. BACKGROUND AND MOTIVATION performance is affected by various factors including the
Automatic Speech recognition (ASR) seems to be actually existence of noise; the number of HMM states associated with
in a blocked alley. Nearly all solutions are of the same general each phoneme; the phoneme combination used and the
model [13]. The research focused on enhancing its phonemes length. Enhancing performance of the present ASR
performance by integrating supplementary elements. Such an techniques needs the examination of these cited factors in order
approach yielded better results but it must be admitted that to localize and recognize the regions of enhancement.
there is a limit which cannot be overrun without modification Nonetheless, no fully statistical analysis at the phoneme
of the general scheme. The method based on hidden Markov level has been implemented on this speech database of classical
models (HMMs) with features of fixed frames length has found Arabic sounds used in this work. Statistical analysis of Arabic
its utility in numerous applications. However, it does not seem phonemes gives a comprehensible vision of phonemes
to be effective enough to transcribe properly any spoken behavior and provides the capability to regulate this behavior
language with a large vocabulary. There are several reasons. by investigating the gathered statistics. For example, the
Some of them are very straightforward in their nature. The frequency of a specific phoneme in a speech database can be
dictionary-based ASR system will never work correctly for employed to correct its misrecognition during the decoding
out-of-dictionary words. Grammar models will not deal process. This means replacing this misrecognized phoneme by
correctly with incorrectly spoken utterances while humans very the highest probably one.
often can.
Furthermore, the average duration of a particular phoneme
ASR system tries to recognize speech via these matching can also be utilized to estimate the number of HMM states that
techniques, while humans can easily understand it and adopt it are most appropriate for recognizing it. Additional statistical
to mistakes and unusual words. This causes the mentioned information such as mode (the midst value in a set of values)
limit of the classical ASR approaches. The standard ASR and median (the most frequent value in a set of values) are
approach is, indeed, based on guess and luck in few steps of its advantageous in addressing the misrecognized phonemes
procedures. The inputted speech is segmented into frames during the decoding process. In this paper, we present a full
without any motivated rules. HMM attempts to find the closest statistical analysis of Arabic phonemes which can be employed
transcription on the basis of speech features which, indeed, a for the purpose of enhancing performance of our baseline
kind of guessing. Such approach works well enough for plainly HMM-based systems by reducing the word error rate (WER)
spoken words with a limited vocabulary. Noise, the speaking factor.
rate and the large vocabulary cause many exclusions and data
missing which HMM cannot deal with correctly. Another IV. SPEECH DATABASE OF SOUNDS
major problem is that people do not speak as carefully as they The Arabic language is the official language of about 300
write, while we anticipate a transcription produced by an ASR million speakers around the world. It is the religious language
system to be of the grade of our typed texts. of all Muslims around the world, regardless of their native
It has also to be admitted by both ordinary users and language. It is the official language in all Arab countries and
researchers, that when we speak we do not, at all times, follow the 6th most widely utilized language in terms of first language
grammar rules and, furthermore, the mistakes in pronunciation speakers. Arabic can be categorized into two main variants:
involve various exceptions independently of the dictionary size Classical Arabic (CA) and Modern Standard Arabic (MSA).
used. This is why adopting a hypothesis using related language CA is an old literary form of Arabic, which is the most formal
rules and a limited dictionary does not always work type and is the language of the Holy Quran and the old Arabic
satisfactorily. The same issues take place in the case of names, poetry. MSA is the current standard form of Arabic, which is
out-of-language words, and the mispronounced phonemes, etc. utilized in official communications in Arabic countries,
ASR system attempts to adopt the inputted speech to the broadcast news, formal speeches, etc. Although there is no big
language rules and the static vocabulary, which, in certain difference between today's Arabic (MSA) and that spoken by
cases, leads to supplementary distortions and hence to the early Arabs (CA), due to the fact that Arabic is one of the
degradation in system performance. most stable languages throughout history, yet there are some
idiosyncrasies as to the way of pronunciation.
There is no straightforward solution for the above-
described problems. In this work, we suggest the use of One of the main barriers faced by the development of ASR
collected phoneme statistics in a target language in order to be applications for Arabic speech is the rarity of suitable sound
used as, for instance, a support for the dictionary if there is a databases commonly required for training and testing statistical
difficulty in associating matching features to one of the words models. This problem is seriously approached when dealing
to be recognized in the vocabulary. with classical Arabic language since most of the corpora
available nowadays are specifically oriented towards what is
The most outstanding research works carried out on known as Modern Standard Arabic (MSA) and its sub-forms
continuous speech is based on statistical approaches (i.e. dialects). To remedy this problem and to assist the
specifically Hidden Markov Models (HMM). Many HMM- development of ASR applications for classical Arabic
based ASR systems for continuous Arabic speech have reached language, a speech database covering all classical Arabic
various levels of recognition accuracy and encouraging sounds was designed on the basis of Quranic recitations. The
performances which have been achieved [14-18]. The accuracy speech corpus was developed in a previously funded project by
of recognition is usually measured by the correct percentage of Al-Imam Muhammad ibn Saud Islamic University in Saudi
recognized phonemes. The HMM-based ASR systems Arabia with the support of King Abed Al-Aziz City for Science
240 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 2, 2017
and Technology (KACST). Because of the difficulty of V. STATISTICS EXTRACTION METHOD
developing this kind of corpora, only a part of the Holy Quran To extract statistics from the speech database, a computer
was regarded. Recitations of ten male speakers were recorded program was designed using MATLAB programming language
in an appropriate environment under the supervision of an developed by MathWorks [22]. The occurrence probability of
expert of the holy Quran pronunciation rules (called Tajweed); each basic phoneme, frequency of occurrence of basic
more than eight hours of speech were achieved [19-21]. Each phoneme, mean duration, Min and Max durations for each
audio file is a Quranic verse or a portion of it for long verses basic phoneme, mode and the median of duration for each
where the speaker must take a long breath. basic phoneme were calculated. Durations are computed on the
In order to have a speech database useful for many goals, basis of phonemes boundary extracted from TextGrids files
speech signals were manually and accurately segmented into attached withal the speech database Sound.
three levels: word, phoneme and allophone. A new labeling These gathered statistics are displayed in Table 3 (see Table
system was proposed to annotate the speech segments [16] III) which also shows the labels used for every basic phoneme
because the labeling systems available (e.g. IPA, SAMPA, in the speech database. Fig. 1 shows the mean of basic
BEEP, etc.) were not able to cover all Arabic sounds. phonemes durations measured in second. The frequency of
However, the speech database consists of 44.1 KHz wav files each basic phoneme in the whole database is shown in Fig. 2.
of 16 millisecond utterances over its corresponding MFCC For an in-depth analysis of the collected statistic and for the
feature files, label files and TextGrids files. purpose to have extra information about the characteristics of
Table I lists for each speaker, the number of sound files, the basic Arabic phonemes, useful graphs are depicted in
their size and duration. The list of basic Arabic phonemes and Figures 3, 4,5 and 6.
their associated codes are shown in table II.
M e a n P h o n e m e D u ra tio n
TABLE I. SOUND FILES AND THEIR DURATION BY SPEAKERS
0 .4
Speaker Speaker Number of Duration Size
M e a n D u r a tio n in S e c o n d
Number Initials Sound Files (minutes) (MB)
1 AAH 600 49.36 249 0 .3
2 AAS 590 52.09 261
3 AMS 612 45.78 229
4 ANS 597 49.72 250 0 .2
5 BAN 585 54.75 276
6 FFA 578 44.11 220
7 HSS 601 49.76 251 0 .1
8 MAS 580 46.24 232
9 MAZ 608 51.47 258
10 SKG 584 44.29 220 0 .0
s il
js 1 0
is 1 0
ls 1 0
ms10
ws10
db10
hb10
v b10
ns10
sb10
ds10
us10
qs10
gs10
v s10
hs10
bs10
z b10
hz 10
cs10
ss10
xs10
as10
ks10
z s10
ys 1 0
tb 1 0
rs10
ts 1 0
jb 1 0
fs 1 0
487.53
Total 5935 2446
(8h, 8m)
B a s ic A r a b ic P h o n e m e s
TABLE II. LIST OF BASIC ARABIC PHONEMES AND THEIR CODES
Arabic Arabic Fig. 1. Mean Duration of the Basic Arabic Phonemes
Label Label
Orthography Orthography
ـَـ فححة as10 ص صاد sb10
ـُـ ضمة us10 ض ضاد db10 F re q u e n c y
ِـ كسرة is10 ط طاء tb10
50000
ء همسة hz10 ظ ظاء zb10
ب باء bs10 ع عيه cs10
ت جاء ts10 غ غيه gs10 40000
خ ثاء vs10 ف فاء fs10
F re q u e n c y
ج جيم jb10 ق قاف qs10 30000
ح حاء hb10 ك كاف ks10
خ خاء xs10 ل الم ls10 20000
د دال ds10 م ميم ms10
ذ ذال vb10 ن وون ns10
10000
ر راء rs10 هـ هاء hs10
ز زاء zs10 و واو ws10
ش سيه ss10 ي ياء ys10 0
s il
js 1 0
is 1 0
ls 1 0
ms10
ws10
db10
hb10
v b10
ns10
sb10
ds10
us10
qs10
gs10
v s10
hs10
bs10
z b10
hz 10
cs10
ss10
xs10
as10
ks10
z s10
ys 1 0
tb 1 0
rs10
ts 1 0
jb 1 0
fs 1 0
ش شيه js10 صامث sil
In addition, the speech database contains a list of 60 Arabic B a s ic A r a b ic P h o n e m e s L a b e l s
phonemes, an Arabic dictionary, a list of all unrepeated words
included in the whole eight hours speech database and other Fig. 2. Basic Arabic Phonemes Frequencies
useful files needed for the recognizer development.
241 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 2, 2017
A r a b ic P h o n e m e s O c c u r a n c e P r o b a b ilit y
be helpful for the purpose of enhancing performance of the
baseline recognizer as we will evoke in the next sections. Basic
0 .3
phoneme duration medians give a clearly view of those
clusters. Classes of the phonemes groups are being
differentiated from each other and a clear parting among
0 .2
P r o b a b ilit y
phoneme groups becomes more obvious, as seen in Fig. 5.
C la s s e d A ra b ic P h o n e m e s B a s e d O n D u ra tio n M o d e
0 .1
0 .2 5
D u r a t io n M o d e In S e c o n d
0 .2 0
0 .0
s il
js 1 0
is 1 0
ls 1 0
ms10
ws10
db10
hb10
v b10
ns10
sb10
ds10
us10
qs10
gs10
v s10
hs10
bs10
z b10
hz 10
cs10
ss10
xs10
as10
ks10
z s10
ys 1 0
tb 1 0
rs10
ts 1 0
jb 1 0
fs 1 0
0 .1 5
B a s ic A r a b ic P h o n e m e s 0 .1 0
Fig. 3. Basic Arabic Phonemes Occurrence Probability 0 .0 5
0 .0 0
s il
ls 1 0
ms10
is 1 0
js 1 0
ws10
v b10
hb10
db10
ns10
hz 10
z b10
ds10
gs10
bs10
hs10
us10
v s10
qs10
sb10
C la s s e d A r a b ic P h o n e m e s B a s e d O n M e a n D u r a t io n
z s10
cs10
ks10
as10
ss10
xs10
ys 1 0
rs10
tb 1 0
jb 1 0
fs 1 0
ts 1 0
0 .4
B a s ic A r a b ic P h o n e m e s
M e a n D u r a tio n in S e c o n d
0 .3 Fig. 6. Sorted Basic Arabic Phonemes based on their Modes
0 .2
Another significant graph is the one demonstrating the most
frequent duration value of all occurrences of a basic phoneme
appearing in the "CA Sound Database". This is referred as the
0 .1
mode, and is displayed in Fig. 6.
0 .0 VI. STATISTICS ANALYSIS
s il
ls 1 0
js 1 0
ms10
is 1 0
ws10
v b10
hb10
db10
hz 10
hs10
z b10
v s10
gs10
bs10
qs10
ds10
sb10
us10
ns10
cs10
ks10
z s10
xs10
ss10
as10
ys 1 0
rs10
tb 1 0
fs 1 0
jb 1 0
ts 1 0
When taking a look at the previous tables and graphs, we
B a s ic A r a b ic P h o n e m e s L a b e ls
find that each basic phoneme occurs with various frequencies,
the highest frequent ones are “as10” ()فححة, is10” ( )كسـرةand
Fig. 4. Sorted Basic Arabic Phonemes based on their Means “us10” ()ضـمة, respectively, which designate the Arabic
vowels. Otherwise the smallest frequent ones are “zb10” ( حرف
)الظاء, “gs10” ()حرف الغيه, and “zs10” ()حرف الساء, respectively,
C la s s e d A r a b ic P h o n e m e s B a s e d O n D u r a t io n M e d ia n s
ignoring the phoneme denoting the silence “sil” ()صامث. From
0 .3
the results shown in Figures 2 and 3; it seems clear that when a
M e d ia n D u ra tio n in S e c o n d
phoneme is missed throughout the decoding process, phoneme
"as10" is automatically the most probable one replacing it.
0 .2
Generally, the results concluded from Fig. 3 can be employed
to correct the pronunciations for a misrecognized phoneme in
0 .1
spoken utterances during the recognition phase. The use of this
information seems useful in enhancing the baseline system
performance.
0 .0
Fig. 4 illustrates the entire basic Arabic phonemes sorted on
s il
ms10
ls 1 0
is 1 0
js 1 0
ws10
v b10
hb10
db10
hz 10
z b10
bs10
hs10
v s10
gs10
qs10
us10
ds10
sb10
ns10
z s10
ks10
as10
xs10
ss10
cs10
ys 1 0
rs10
tb 1 0
fs 1 0
jb 1 0
ts 1 0
the basis of their average durations. From this Figure, we can
B a s ic A r a b ic P h o n e m e s
clearly show the behavior of the basic phoneme durations
through the whole speech database. Thus, the figure provides
Fig. 5. Sorted Basic Arabic Phonemes based on their Medians an explicit idea about the average duration of each phoneme,
which means that a basic phoneme clusters being distinguished
Fig. 3 shows the occurrence probability of the basic Arabic from it. For example, the basic phonemes “hz10” and “rs10”
phonemes in the whole speech database. This useful graph will form the first cluster. The second cluster includes:
serve in defining the probability of missing phonemes during “vb10”,”fs10” and “hs10”. The vowels form the last cluster in
the decoding process. However, we noted that the phoneme terms of the highest average durations. Usually, knowing the
“sil” denoting the silence regardless of its occurring places in average length of a specific phoneme in a speech database can
the speech database is included in all depicted graphs. be utilized for estimating the appropriate number of the HMM
states that represent it, which generally accelerate the
In interesting outcome which is apparent from Fig. 4 proves
estimation period and hence enhance the accuracy of
that basic phonemes having equal or approximate mean values
recognition.
can be grouped into clusters. we assume that these clusters will
242 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 2, 2017
In Fig. 5 and Fig. 6, median and mode durations for each [4] H. Jiang, “Discriminative training for automatic speech recognition: A
basic phoneme are displayed, where the basic phonemes survey,” Computer Speech & Language, Comput. Speech, vol. 24, no.
4, pp. 589–608, 2010.
clusters appear clearly. The outcomes of both figures could be
[5] L. Deng and X. Li, "Machine Learning paradigms for speech
helpful to make the correct decision in dealing with either recognition: An overview,” IEEE Trans. Audio, Speech, Lang. Process.,
misrecognized or missed phonemes. It means that replacing vol. 21, no. 5, pp. 1060–1089, May 2013.
them with the near median or mode phoneme. [6] I, Oparin, Language Models for Automatic Speech Recognition Of
inflectional Languages. PhD Thesis, University of West Bohemia, Plzen,
VII. CONCLUSION Czech Republic (2009).
In this paper, we have presented a collection of statistical [7] Y.O.M. Elhadj, I.A. Alsughayeir, M. Alghamdi, M. Alkanhal, Y.M.
data for Basic Arabic phonemes helpful in enhancing HMM- Ohali, A.M. Alansari, Computerized teaching of the Holy Quran (in
Arabic), Final Technical Report, King Abdulaziz City for Sciences and
based automatic speech recognition systems performance. In Technology (KACST), Riyadh, KSA,2012.
the literature, the duration of phonemes is regarded as major [8] Y.O.M. Elhadj, M. Alghamdi, and M. Alkanhal, “Phoneme-Based
distinctive feature characterizing the voice of a speaker. Recognizer to Assist Reading the Holy Quran,”, Recent Advances in
Knowing the duration of a particular phoneme in a spoken Intelligent Informatics, Advances in Intelligent Systems and Computing,
utterances can be utilized to estimate the length of the HMM Springer, pp.141-152,2014.
chain describing it, which in consequence improves the system [9] Y.O.M. Elhadj, M. Alghamdi, and M. Alkanhal, “Approach for
performance. These investigations were performed using a Recognizing Allophonic Sounds of the Classical Arabic Based on Quran
Recitations,”, Theory and Practice of Natural Computing, Lecture Notes
particular speech database of Quranic sounds including more in Computer Science, Springer, pp. 57-67, 2013.
than eight hours of speech and ten different male speakers. [10] Y.O.M. Elhadj, Mohamed .O.M. Khelifa, A. Yousfi and M. Belkasmi.
The numerical values are extracted using a computer program “An Accurate Recognizer for Basic Arabic Sounds,” ARPN Journal of
designed for this purpose. A discussion of these results with Engineering and Applied Sciences, vol. 11, no. 5, pp. 3239- 3243, Mar.
interpretations was also presented and reported graphically. 2016.
Dividing phonemes into clusters on the basis of their median of [11] Mohamed O.M. Khelifa, Y.O.M. Elhadj, Y. Abdellah and M. Belkasmi,
the durations can help in decreasing the search for the “Enhancing Arabic Phoneme Recognizer using Duration Modeling
Techniques,”, in proc. of Fourth International Conference on Advances
appropriate phoneme during the decoding process, which in in Computing, Electronics and Communication - ACEC 2016, Dec 15,
consequence increases system performance. Collected statistics 2016, Rome-Italy.
provided can also be used to build or propose other techniques [12] Mohamed O.M. Khelifa, Y.O.M. Elhadj, Y. Abdellah and M. Belkasmi,
for phonemes classifications. While the probability “An Accurate HSMM-based System for Arabic phonemes Recognition,”
distributions in HMM-based ASR systems are usually in proc. of The IEEE Ninth International conference on Advanced
estimated with the Expectation-Maximization iterative Computational Intelligence (ICACI 2017), Feb. 2, 2017, Doha, Qatar.
algorithm, the statistics provided can be utilized as an initial [13] S. Young, Large Vocabulary Continuous Speech Recognition: a Review,
IEEE Signal Processing Magazine 13(5), pp. 45-57, 1996.
condition for the estimation procedure, and, thus, speed up its
execution time, or can also be utilized as a wanted model itself. [14] Ali, A. et al., “A Complete KALDI Recipe for Building Arabic Speech
Recognition Systems”, Spoken Language Technology Workshop (SLT),
We believe that the absence of necessary numerical data IEEE, 2014.
denoting, particularly, the basic Arabic phonemes behavior in [15] Khalid, A. et al., "Arabic Phonemes Transcription using Data
classical Arabic language like those reported here gives an Driven,"The International Arab Journal of Information Technology, Vol.
added value to the presented work. However, our future steps 12, No. 3, May 2015.
will focus on incorporating these statistics explicitly into [16] Speaker-dependant continuous Arabic speech recognition. M.Sc. thesis,
HMMs in order to overcoming the classical HMM's weakness King Saud University, 2001.
and, hence, improve HMM-based systems performance. [17] Hyassat H, Abu Zitar, “Arabic speech recognition using SPHINX
engine,”, Int J Speech Tech 9(3–4):133–150, 2008.
ACKNOWLEDGMENT [18] Azmi, M. et al., “Syllable-based automatic Arabic speech recognition in
noisy-telephone channel,”, In: WSEAS transactions on signal processing
The presented work utilizes the results (Classical Arabic proceedings, World Scientific and Engineering Academy and Society
Sound Database) of a project previously funded by King Abed (WSEAS), vol 4, issue 4, pp 211–220, 2008.
Al-Aziz City for Science and Technology (KACST) in Saudi [19] Y.O.M. Elhadj, M. et al., Design and Development of a High Quality
Arabia under grant number “AT – 25 – 113”. Speech Corpus for Classical Arabic. Submitted for publication to the
Language Resources and Evalauation Journal (LREV).
REFERENCES
[20] Y.O.M. Elhadj, M. et al., Sound Corpus of a part of the noble Quran (in
[1] D. Jurafsky and J. H. Martin, Speech and Language Processing, 2nd ed., Arabic). Proc. of the International Conference on the Glorious Quran
Pearson Prentice Hall, 2009. and Contemporary Technologies, King Fahd Complex for the Printing of
[2] G. Zweig and P. Nguyen, “A segmental CRF approach to large the Holy Quran, Almadinah, Saudi Arabia, October 13-15, 2009.
vocabulary continuous speech eecognition,” Proc.of IEEE ASRU, 2009. [21] Y.O.M. Elhadj. Preparation of speech database with perfect reading of
[3] H. Sakoe, Two-level DP-matching - a dynamic programming-based the last part of the Holly Quran (in Arabic). Proc. of the 3rd IEEE
pattern matching algorithm for connected word recognition, Readings in International Conference on Arabic Language Processing (CITAL'09),
Speech Recognition, Morgan Kaufmann Publishers Inc, pp. 180-186, pp: 5-8, Rabat, Morocco, May 4-5, 2009.
1990. [22] MATLAB and Statistics Toolbox Release 2013a The MathWorks, Inc.,
Natick, Massachusetts, United States.
243 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 8, No. 2, 2017
TABLE III. THE BASIC ARABIC PHONEMES STATISTICS
Basic Arabic Max duration
Frequency of Min duration Mean-duration Probability of
Phonemes Labels in second Mode Median
occurrence in second in second occurrence
صامث sil 11875 0.022 8.576 0.315 0.230 0.282 0.077
وون ns10 8160 0.021 1.458 0.364 0.068 0.195 0.052
عيه cs10 2700 0.033 0.420 0.124 0.099 0.155 0.017
صاد sb10 838 0.079 0.388 0.183 0.128 0.153 0.005
سيه ss10 2175 0.071 0.384 0.170 0.136 0.149 0.014
خاء xs10 770 0.072 0.420 0.151 0.139 0.139 0.004
دال ds10 2190 0.039 0.433 0.162 0.083 0.136 0.014
شيه js10 867 0.080 0.478 0.152 0.130 0.136 0.005
فححة as10 40396 0.011 3.343 0.207 0.130 0.135 0.262
كسرة is10 12755 0.030 1.833 0.207 0.121 0.135 0.082
ضمة us10 9110 0.029 1.739 0.214 0.110 0.135 0.059
قاف qs10 1870 0.080 0.792 0.151 0.123 0.130 0.012
ضاد db10 443 0.021 0.629 0.155 0.124 0.128 0.002
طاء tb10 560 0.073 0.464 0.163 0.110 0.128 0.003
غيه gs10 410 0.049 0.387 0.138 0.083 0.123 0.002
الم ls10 9066 0.015 0.767 0.146 0.069 0.123 0.058
حاء hb10 1457 0.050 0.335 0.127 0.114 0.122 0.009
جاء ts10 3483 0.019 0.959 0.141 0.114 0.121 0.022
ياء ys10 3677 0.019 1.392 0.150 0.100 0.120 0.023
كاف ks10 3040 0.028 0.480 0.136 0.105 0.119 0.019
ثاء vs10 600 0.032 0.311 0.117 0.117 0.112 0.003
زاء zs10 440 0.060 0.352 0.138 0.094 0.111 0.002
جيم jb10 1240 0.015 0.428 0.130 0.097 0.108 0.008
فاء fs10 3020 0.016 0.369 0.109 0.113 0.105 0.019
هـاء hs10 4559 0.029 0.376 0.113 0.100 0.105 0.029
باء bs10 3739 0.012 0.654 0.144 0.085 0.104 0.024
واو ws10 4647 0.016 1.021 0.124 0.085 0.104 0.030
ميم ms10 6825 0.027 1.640 0.170 0.080 0.099 0.044
ظاء zb10 176 0.054 0.360 0.114 0.082 0.096 0.001
ذال vb10 2091 0.031 0.371 0.110 0.076 0.087 0.013
همسة hz10 6281 0.008 0.295 0.078 0.074 0.076 0.040
راء rs10 4620 0.014 0.403 0.096 0.066 0.075 0.029
244 | P a g e
www.ijacsa.thesai.org