Papers by Masood Ghayoomi

Alzahra University, 2024
Annually, researchers in various scientific fields publish their research results as technical re... more Annually, researchers in various scientific fields publish their research results as technical reports or articles in proceedings or journals. The collocation of this type of data is used by search engines and digital libraries to search and access research publications, which usually retrieve related articles based on the query keywords instead of the article’s subjects. Consequently, accurate classification of scientific articles can increase the quality of users’ searches when seeking a scientific document in databases. The primary purpose of this paper is to provide a classification model to determine the scope of scientific articles. To this end, we proposed a model which uses the enriched contextualized knowledge of Persian articles through distributional semantics. Accordingly, identifying the specific field of each document and defining its domain by prominent enriched knowledge enhances the accuracy of scientific articles’ classification. To reach the goal, we enriched the contextualized embedding models, either ParsBERT or XLM-RoBERTa, with the latent topics to train a multilayer perceptron model. According to the experimental results, overall performance of the ParsBERT-NMF-1HT was 72.37% (macro) and 75.21% (micro) according to F-measure, with a statistical significance compared to the baseline (p<0.05).
Applications of Chatbots in Education
Advances in web technologies and engineering book series, Feb 24, 2023
In this paper we propose a modelization of the construction of Persian noun and adjectival phrase... more In this paper we propose a modelization of the construction of Persian noun and adjectival phrases in a phrase structure grammar. This modelization uses the Interaction Grammar (IG) formalism by taking advantage of the polarities on features and tree descriptions for the various constructions that we studied. The proposed grammar was implemented with a Metagrammar compiler named XMG. A small test suite was built and tested with a parser based on IG, called LEOPAR. The experimental results show that we could parse the phrases successfully, even the most complex ones which have various constructions in them.
5 Syntactic Parsing of Persian: From Theory to Practice
De Gruyter eBooks, May 8, 2023

Language independent optimization of text readability formulas with deep reinforcement learning
Information Design Journal
Readability formulas are used to assess the level of difficulty of a text. These language depende... more Readability formulas are used to assess the level of difficulty of a text. These language dependent formulas are introduced with pre-defined parameters. Deep reinforcement learning models can be used for parameter optimization. In this article we argue that an Actor-Critic based model can be used to optimize the parameters in the readability formulas. Furthermore, a selection model is proposed for selecting the most suitable formula to assess the readability of the input text. English and Persian data sets are used for both training and testing. The experimental results of the parameter optimization model show that, on average, the F-score of the model for English increases from 24.7% in the baseline to 38.8%, and for Persian from 23.5% to 47.7%. The proposed algorithm selection model further improves the parameter optimization model to 65.5% based on F-score for both English and Persian.

Language and Linguistics, 2023
News discourse is a type of discourse analysis that deals with the analysis of news discourse. Du... more News discourse is a type of discourse analysis that deals with the analysis of news discourse. Due to the fact that in the formatting of news there are two hidden features of selection and prominence in the communication representation of news, the inverted pyramid of news is used to grade the importance of the discourse parts of the news. Although it is desirable to meet the structure of the inverted pyramid of news, sometimes this structure may change. In this article, we put an effort to analyse the discourse analysis of Persian news websites with the help of statistical analysis. To research the goal, data science can be used. This inter-discipline deals with data analysis from a scientific aspect, finding implicit concepts to be obtained from data analysis and extracting knowledge from the data. In the framework of data science, we examined the Persian news corpus and studied the existence of semantic correlation between the news title and the news content based on the structure of the news inverted pyramid. To achieve the goal, by using the crawling method, a relatively large news corpus with a volume of 14 billion words has been obtained from 24 news websites. After pre-processing and normalizing the corpus, in the framework of distributional semantics, the vector of title news and content have been created by using the Word2Vec tool for creating the vector model to have the vector representation of each news. After segmenting news content into three parts (lead, body and further explanation about the lead) according to the inverted pyramid, the Pearson correlation coefficient has been used to calculate the correlation between the title and each part of the news. Although Pearson's correlation coefficient was positive for a large number of news, zero value and no correlation was found for the news. On average, the correlation between the headline and the news lead and body was higher than the correlation between the headline and the lead development. This research can be used as a method to carefully select the title and content and filter the news according to the inverted pyramid structure.
Language independent optimization of text readability formulas with deep reinforcement learning
Information Design Journal , 2023

Dialectology studies a dialect scientifically along with its geographical distribution. Each dial... more Dialectology studies a dialect scientifically along with its geographical distribution. Each dialect is a language; and to study a dialect various linguistic analyses are required. This property makes the study of a language a little long in terms of time. Collecting dialectical data is very time consuming and required a lot of effort. As a result, it is required to develop the data in such a way to be reusable in further studies. Raw data is not much usable in dialectology and it is required to add linguistic analyses to the data in the framework of structural linguistic analysis. Since there is a huge amount of dialectic data to be used in projects like developing dialectic atlas in a country and the added linguistic analyses to the data, organizing the data is required. Using a computer as a research tool causes to prepare the data in a specific structure. The main contribution of the current paper is proposing a standard to organize dialectic data and information. This standard contains the dialectic data, its relevant meta-data, and the linguistic information related to the analysis of this data. The meta-data and linguistic information are organized in the XML tree structure. This data structure is highly portable and it can be easily read into a database.

Festschrift for Professor Mahmood Bijankhan, 2022
Parsing is a preliminary step towards natural language understanding. A statistical parser which ... more Parsing is a preliminary step towards natural language understanding. A statistical parser which provides the syntactic analysis of a sentence automatically needs a set of annotated data, called treebank, to learn the grammar of the target language. Providing this set of data is a very time consuming and a tedious task. Active learning as one of the supervised machine learning approaches is a great help to reduce human effort for data annotation. This method minimizes the number of required annotated sentences by selecting informative samples from a data pool and hand them to an oracle to be annotated manually. In this method of data annotation, the focus is only on hard sentences in which the parser does not provide a relatively satisfying analysis. In this paper, we propose a novel method for selecting informative samples in active learning. In this sampling method, called tree similarity, the similarity between each pair of sentences in the data pool and the training data is calculated. The sentences which have a minimum similarity with the training data are selected as hard samples for manual annotation. To calculate the tree similarity between two sentences, a vector of words, part-of-speech tags, and syntactic sub-constructions of each sentence is built. Our experiments on parsing the Persian language show that the proposed approach outperforms the state-of-the-art methods such as entropy-based uncertainty sampling.

International Journal of Information Science and Management, 2023
This paper provides a comparative analysis of cross-lingual word embedding by studying the impact... more This paper provides a comparative analysis of cross-lingual word embedding by studying the impact of different variables on the quality of the embedding models within the distributional semantics framework. Distributional semantics is a method for the semantic representation of words, phrases, sentences, and documents. This method aims at capturing as much information as possible from the contextual information in a vector space. The early study in this domain focused on monolingual word embedding. Further progress used cross-lingual data to capture the contextual semantic information across different languages. The main contribution of this research is to make a comparative study to find out the superior impact of the learning methods, supervised and unsupervised in training and post-training approaches in different embedding algorithms, to capture semantic properties of the words in cross-lingual embedding models to be applicable in tasks that deal with multi-languages, such as question retrieval. To this end, we study the cross-lingual embedding models created by BilBOWA, VecMap, and MUSE embedding algorithms along with the variables that impact the embedding models' quality, namely the size of the training data and the window size of the local context. In our study, we use the unsupervised monolingual Word2Vec embedding model as the baseline and evaluate the quality of embeddings on three data sets: Google analogy, mono-and cross-lingual words similar lists. We further investigated the impact of the embedding models in the question retrieval task.
* As observed, pronouns are not tagged due to having a bug in SLSP tool. We have done new experim... more * As observed, pronouns are not tagged due to having a bug in SLSP tool. We have done new experiments with the debugged tool for two targets as discussed in Section 6.5.

Enriching contextualized semantic representation with textual information transmission for COVID-19 fake news detection: A study on English and Persian
Digital Scholarship in the Humanities
The COVID-19 pandemic provided an infodemic situation to face people in the society with a massiv... more The COVID-19 pandemic provided an infodemic situation to face people in the society with a massive amount of information due to accessing social media, such as Twitter and Instagram. These platforms have made the information circulation easy and paved the ground to mix information and misinformation. One solution to prevent an infodemic situation is avoiding false information distribution and filtering the fake news to reduce the negative impact of such news in the society. This article aims at studying the properties of fake news in English and Persian using the textual information transmitted through language in the news. To this end, the properties existed in a text based on information theory, stylometry information from raw texts, readability of the texts, and linguistic information, such as phonology, syntax, and morphology, are studied. In this study, we use the XLM-RoBERTa representation with a convolutional neural network classifier as the basic model to detect English and ...

Maschinelles Lernen für die Annotation von Persischen Daten
Parsing is a step for understanding a natural language to find out about the words and their gram... more Parsing is a step for understanding a natural language to find out about the words and their grammatical relations in a sentence. Statistical parsers require a set of annotated data, called a treebank, to learn the grammar of a language and apply the learnt model on new, unseen data. This set of annotated data is not available for all languages, and its development is very time- consuming, tedious, and expensive. In this dissertation, we propose a method for treebanking from scratch using machine learning methods. We first propose a bootstrapping approach to initialize the data annotation process. We aim at reducing human intervention to annotate the data. After developing a small data set, we use this data to train a statistical parser. This small data set suffers from the sparseness of data at the lexical and syntactic construction levels. Therefore, a parser trained with this amount of data might have a low performance in a real application. To resolve the data sparsity problem a...

Parsing is a step for understanding a natural language to find out about the words and their gram... more Parsing is a step for understanding a natural language to find out about the words and their grammatical relations in a sentence. Statistical parsers require a set of annotated data, called a treebank, to learn the grammar of a language and apply the learnt model on new, unseen data. This set of annotated data is not available for all languages, and its development is very time- consuming, tedious, and expensive. In this dissertation, we propose a method for treebanking from scratch using machine learning methods. We first propose a bootstrapping approach to initialize the data annotation process. We aim at reducing human intervention to annotate the data. After developing a small data set, we use this data to train a statistical parser. This small data set suffers from the sparseness of data at the lexical and syntactic construction levels. Therefore, a parser trained with this amount of data might have a low performance in a real application. To resolve the data sparsity problem a...

Alzahra University, 2022
One subfield of assessment of language proficiency is predicting language proficiency level.
Thi... more One subfield of assessment of language proficiency is predicting language proficiency level.
This research aims at proposing a computational linguistic model to predict language proficiency level and to explore the general properties of the levels. To this end, a corpus is developed from Persian learners' textbooks and statistical and linguistic features are extracted from this text corpus to train three classifiers as learners. The performance of the models vary based on the learning algorithm and the feature set(s) used for training the models. For evaluating the models, four standard metrics, namely accuracy, precision, recall, and F-measure were used.
Based on the results, the model created by the Random Forest classifier performed the best when statistical features extracted from raw text is used. The Support Vector Machine classifier performed the best by using linguistic features extracted from the automatically annotated corpus. The results determine that enriching the model and providing various kinds of information do not guarantee that a classifier (learner) performs the best.
To discover the latent teaching methodology of the textbooks, the general performance of the classifiers with respect to the language level and the linguistic knowledge used for creating the model are studied. Based on the obtained results, the amount of extracted features plays an important role in training a classifier. Furthermore, the average best performance of the classifiers is extending the linguistic knowledge from syntactic patterns at proficiency level A (beginner) to all linguistic information at levels B (intermediate) and C (advanced).

Most of the reliable language resources are developed via human supervision. Developing supervise... more Most of the reliable language resources are developed via human supervision. Developing supervised annotated data is hard and tedious, and it will be very time consuming when it is done totally manually; as a result, various types of annotated data, including treebanks, are not available for many languages. Considering that a portion of the language is regular, we can define regular expressions as grammar rules to recognize the strings which match the regular expressions, and reduce the human effort to annotate further unseen data. In this paper, we propose an incremental bootstrapping approach via extracting grammar rules when no treebank is available in the first step. Since Persian suffers from lack of available data sources, we have applied our method to develop a treebank for this language. Our exper-iment shows that this approach significantly decreases the amount of manual effort in the annotation process while enlarging the treebank.

Deep transfer learning for COVID ‐19 fake news detection in Persian
Expert Systems, 2022
The spread of fake news on social media has increased dramatically in recent years. Hence, fake n... more The spread of fake news on social media has increased dramatically in recent years. Hence, fake news detection systems have received researchers&#39; attention globally. During the COVID-19 outbreak in 2019 and the worldwide epidemic, the importance of this issue becomes more apparent. Due to the importance of the issue, a large number of researchers have begun to collect English datasets and to study COVID-19 fake news detection. However, there are a large number of low-resource languages, including Persian, that cannot develop accurate tools for automatic COVID-19 fake news detection due to the lack of annotated data for the task. In this article, we aim to develop a corpus for Persian in the domain of COVID-19 where the fake news is annotated and to provide a model for detecting Persian COVID-19 fake news. With the impressive advancement of multilingual pre-trained language models, the idea of cross-lingual transfer learning can be proposed to improve the generalization of models trained with low-resource language datasets. Accordingly, we use the state-of-the-art deep cross-lingual contextualized language model, XLM-RoBERTa, and the parallel convolutional neural networks to detect Persian COVID-19 fake news. Moreover, we use the idea of knowledge transferring across-domains to improve the results by using both the English COVID-19 dataset and the general domain Persian fake news dataset. The combination of both cross-lingual and cross-domain transfer learning has outperformed the models and it has beaten the baseline by 2.39% signi�cantly. Deep transfer learning for COVID-19 fake news detection in Persian-G...
A Tentative Method of Tokenizing Persian Corpus based on Language Modelling
Language and Linguistics, May 22, 2018

Iranian Journal of Information Processing & Management, 2019
A word is the smallest unit in the language that has 'form' and 'meaning'. The wo... more A word is the smallest unit in the language that has 'form' and 'meaning'. The word might have more than one meaning in which its exact meaning is determined according to the context it is appeared. Collecting all words' senses manually is a tedious and time consuming task. Moreover, it is possible that the words' meanings change over time such that the meaning of an existing word will become unusable or a new meaning will be added to the word. Computational methods is one of the approaches used for identifying words' senses with respect to the linguistic contexts. In this paper, we put an effort to propose an algorithm to identify senses of Persian words automatically without a human supervision. To reach this goal, we utilize the word embedding method in a vector space model. To build words' vectors, we use an algorithm based on the neural network approach to gather the context information of the words in the vectors. In the proposed model of this r...
Persian is one of the Indo-European languages which has borrowed its script from Arabic, a member... more Persian is one of the Indo-European languages which has borrowed its script from Arabic, a member of Semitic language family. Since Persian and Arabic scripts are so similar, problems arise when we want to process an electronic text. In this paper, some of the common problems faced experimentally in developing a corpus for Persian are discussed. The sources of the problems are the Persian script itself; mixture with the Arabic script; Persian orthography; the typists ’ typing styles; and mixing Persian code pages with Arabic in the operating systems; linguistic style and creativity in the language.
Uploads
Papers by Masood Ghayoomi
This research aims at proposing a computational linguistic model to predict language proficiency level and to explore the general properties of the levels. To this end, a corpus is developed from Persian learners' textbooks and statistical and linguistic features are extracted from this text corpus to train three classifiers as learners. The performance of the models vary based on the learning algorithm and the feature set(s) used for training the models. For evaluating the models, four standard metrics, namely accuracy, precision, recall, and F-measure were used.
Based on the results, the model created by the Random Forest classifier performed the best when statistical features extracted from raw text is used. The Support Vector Machine classifier performed the best by using linguistic features extracted from the automatically annotated corpus. The results determine that enriching the model and providing various kinds of information do not guarantee that a classifier (learner) performs the best.
To discover the latent teaching methodology of the textbooks, the general performance of the classifiers with respect to the language level and the linguistic knowledge used for creating the model are studied. Based on the obtained results, the amount of extracted features plays an important role in training a classifier. Furthermore, the average best performance of the classifiers is extending the linguistic knowledge from syntactic patterns at proficiency level A (beginner) to all linguistic information at levels B (intermediate) and C (advanced).