Academia.eduAcademia.edu

Outline

Training Cross-Lingual embeddings for Setswana and Sepedi

2021, Journal of the Digital Humanities Association of Southern Africa (DHASA)

https://doi.org/10.55492/DHASA.V3I03.3822

Abstract

How can we transfer semantic information between two low-resourced African languages? With one language having more resources than the other? The problem is that African languages still lag in the advances of Natural Language Processing techniques, one reason being the lack of representative data, having a technique that can transfer information between languages can help mitigate against the data problem. This paper trains Setswana and Sepedi monolingual word vectors and uses VecMap to create cross-lingual embeddings for Setswana-Sepedi since cross-lingual embeddings can be used as a method of transferring semantic information from rich to low-resourced languages. Word embeddings are word vectors that represent words as continuous floating numbers where semantically similar words are mapped to nearby points in n-dimensional space. Each point captures the meaning of a word with semantically similar words having similar vector values near the point. The vectors are captured in a mann...

Proceedings of the International Conference of the Digital Humanities Association of Southern Africa 2021 Training Cross-Lingual gual word vector. We use the unsupervised cross lin- gual embeddings in VecMap to train the Setswana- embeddings for Setswana and Sepedi cross-language word embeddings. We evalu- Sepedi ate the quality of the Setswana-Sepedi cross-lingual word representation using a semantic evaluation Makgatho, Mack task. For the semantic similarity task, we translated Dept. of Computer Science, University of Pretoria the WordSim and SimLex tasks into Setswana and [email protected] Sepedi. We release this dataset as part of this work for other researchers. We evaluate the intrinsic qual- Marivate, Vukosi ity of the embeddings to determine if there is im- arXiv:2111.06230v1 [cs.CL] 11 Nov 2021 Dept. of Computer Science, University of Pretoria provement in the semantic representation of the [email protected] word embeddings. Keywords: cross-lingual embeddings, word embed- Sefara, Tshephisho dings, intrinsic evaluation Council for Scientific and Industrial Research [email protected] 1 Introduction Wagner, Valencia Many African languages have insufficient language Sol Plaatje University resources (data, tools, people) (Abbott & Martinus [email protected] 2019, Martinus & Abbott 2019, Nekoto et al. 2020, Sefara et al. 2021) and fall into the classification Abstract of low resource languages (Ranathunga et al. 2021) in the Natural Language Processing (NLP) field. African languages still lag in the advances of Natu- This lack of resources makes it harder to capi- ral Language Processing techniques, one reason be- talise on the recent advances in many NLP down- ing the lack of representative data, having a tech- stream tasks such as Neural Machine Transla- nique that can transfer information between lan- tion (Cho et al. 2014), Large Language Models guages can help mitigate against the lack of data (Devlin et al. 2018, Howard & Ruder 2018), Q&A problem. This paper trains Setswana and Sepedi systems (Kwiatkowski et al. 2019), etc. There monolingual word vectors and uses VecMap to cre- may be more downstream approaches to deal ate cross-lingual embeddings for Setswana-Sepedi in with some of these challenges such as Transfer order to do a cross-lingual transfer. Learning (Ruder et al. 2019), Data Augmentation Word embeddings are word vectors that represent (Marivate & Sefara 2020a), Multilingual Models words as continuous floating numbers where se- (Hedderich et al. 2020), etc. Additionally, the mantically similar words are mapped to nearby lack of research attention to existing NLP tech- points in n-dimensional space. The idea of word niques results in difficulties finding a benchmark embeddings is based on the distribution hypothe- (Abbott & Martinus 2019). In this work, we focus sis that states, semantically similar words are dis- on word representations through word embeddings tributed in similar contexts (Harris, 1954). and how we can leverage one language to assist in the Cross-lingual embeddings leverages monolingual representation of another related language. These embeddings by learning a shared vector space for embeddings can then be used to develop tools for two separately trained monolingual vectors such other downstream tasks. that words with similar meaning are represented by Word Embeddings are a mathematical technique to similar vectors. In this paper, we investigate cross- learn general language vector representations from a lingual embeddings for Setswana-Sepedi monolin- large amount of unlabelled text using co-occurring 1 DHASA2021 statistics. In recent years, monolingual word em- The majority of South African languages lag bilin- beddings techniques are increasingly becoming an gual parallel corpus with other languages. In this important resource in NLP. Word embeddings work, we aim to investigate how cross-lingual em- are widely used in NLP problems such as sen- beddings could be used to improve the state of one timent analysis (Socher et al. 2013), named-entity- or both languages. We used data (corpus) from recognition (Guo et al. 2014), parts-of-speech tag- different domains to train Word2Vec and fastText ging, and document retrieval. Word2Vec is a vector (Bojanowski et al. 2016) monolingual embeddings. training model proposed by Mikolov et al. (2013). When using VecMap, the two embeddings are Word2Vec produces a low-dimensional real-value aligned. VecMap requires two monolingual word vector representing the meaning of a word. The vectors from source and target (Artetxe et al. 2018). word vector represents grammatical and semantic To evaluate the effectiveness of the cross-lingual em- properties, which results in words with similar se- bedding for Setswana and Sepedi, we use intrinsic mantic relations being close to each other. The evaluation (Bakarov 2018) through Setswana and word vector representation method incorporates Sepedi versions of WordSim (Finkelstein et al. 2001) the semantic relationship between words which is and Simlex (Hill et al. 2015). This is following on not possible through representations such as Bag- an approach that has been used for Yoruba and Of-Words of TFIDF. Word embeddings are better Twi (Alabi et al. 2019). We also release the dataset than both methods because they map all the words for this benchmark of human semantic similarity in a language into a vector space of a given dimen- task. sion, the words are converted into vectors and al- This paper is structured as follows; the next sec- low multiple linear operations and have the prop- tion is a review of related work that is done on erty of preserving analogies (Mikolov et al. 2013, cross-lingual word vectors. Followed by data col- Pennington et al. 2014). lection in Section 3. Section 4 discusses methodol- Cross-lingual word embeddings have been receiv- ogy followed to train cross-lingual word vectors us- ing more and more attention from the NLP com- ing VecMap. The evaluation of the word vectors is munity, mainly because it has provided a path to discussed in Section 5. Section 5.1 explains the re- effectively align two disjoint monolingual embed- sults while Section 6 discusses the findings and fi- dings with no bilingual dictionary for unsuper- nally, conclusions and future work can be found in vised techniques or no more than a small bilingual Section 7. dictionary for supervised techniques (Lample et al. 2018, Artetxe et al. 2020). Cross-lingual tech- 2 Background and Related niques also enable knowledge transfer between lan- guages with rich resources and low resources. For Work languages lacking bilingual parallel corpus with Cross-lingual word embeddings (CLWEs) are be- other languages, cross-lingual embeddings can be coming popular in NLP for two reasons: Cross- utilised to train high-quality cross-lingual embed- lingual word embeddings can transfer knowledge dings (Lample et al. 2018). This can aid in accelerat- from rich-resourced languages to low-resourced; ing the progress of applying NLP to low-resourced The technique can also infer the semantics of words languages. Artetxe et al. (2018) created the cross- in a multiple language environment. Conneau et al. lingual unsupervised or supervised word embed- (2018) show that word embeddings spaces can be ding (VecMap library) approach for training cross- aligned without any cross-lingual supervision. The lingual word embedding models. The approaches alignment is based on solely unaligned datasets of can be used to construct cross-language word vec- each language. Using adversarial training, they tors with or without a bilingual dictionary. were able to initialise a linear mapping between a source and a target space, which they use to create 2 Proceedings of the International Conference of the Digital Humanities Association of Southern Africa 2021 a synthetic parallel dictionary. First, they propose cation. The research mainly focused on vector rep- a simple criterion that is used as an unsupervised resentations of sentences that are general for the in- validation matric. Second, they propose the sim- put language and the NLP task. ilarity measure cross-domain similarity local scal- Alabi et al. (2019) worked on massive vs. curated ing (CSLS), which mitigates the hubness problem embeddings for low-resourced languages: the case and increases the word translation accuracy. The of Yorùbá and Twi. Authors compare two types hubness problem is defined by Dinu et al. (2015) of word embeddings obtained from curated cor- as: pora and a language-dependent processing. They ”neighbourhoods of the mapped ele- move further to collect high quality and noisy data ments are strongly polluted by hubs, vec- for the two languages. They quantify that im- tors that tend to be near a high propor- provements that is based on the quality of data tion of items, pushing their correct labels and not only on the amount of data. In their ex- down the neighbour list.” periments, they use different architectures to learn In the work done by Adams et al. (2017), the re- word representations both from characters and sur- search looked at applying CLWEs to Yongning Na, face forms. They evaluate multilingual BERT on a a Sino-Tibetan language. The research focused down stream task, specifically named entity recog- on determining if the quality of CLWEs depends nition and WordSim-353 word pairs dataset. on having large amounts of data in multiple lan- Feng et al. (2018) investigates a cross-lingual knowl- guages and if initialising the parameters of neu- edge transfer technique to improve the seman- ral network language models (NMLM) can im- tic representation of low-resourced languages and prove language modelling in a low-resourced con- improving low resource named-entity recognition. text. The research scaled down the available mono- In their research, neural networks are used to do lingual data of the target language to about 1000 knowledge transfer from high resource language us- sentences. The quality of intrinsic embedding ing bilingual lexicons to improve low resource word was assessed by taking into consideration correla- representation. They automatically learn semantic tion with human judgement on the WordSim353 projections using a lexicon extension strategy that (Finkelstein et al. 2001) test set. They went further is designed to address out-of lexicon problem. Fi- to perform language modelling experiments by ini- nally, they regard word-level entity type distribu- tialising the parameters for long short-term mem- tion features as an external language independent ory (LSTM) (Hochreiter & Schmidhuber 1997) by knowledge and incorporate them into their neural training across different language pairs. The re- architecture. The experiment is done on two low search showed that CLWEs are resilient even when resource languages (Dutch and Spanish) to demon- target language training data is scaled-down and strate the effectiveness of these additional semantic that initialisation of NMLM parameters leads to representations. good performance. Banerjee et al. (2021) show that initialising the em- Artetxe & Schwenk (2019) introduced an architec- bedding layer of Unsupervised Neural Machine ture that can be used to learn multilingual sentence Translation (UNMT) models with cross-lingual representations for more than 90 languages. The embeddings shows significant improvements in languages belonged to 30 different families. The re- BLEU score. Authors show that freezing the em- search used a single BiLSTM encoder with a shared bedding layer weights lead to better gains com- Byte Pair Encoding (BPE) vocabulary coupled with pared to updating the embedding layer weights an auxiliary decoder and trained on parallel corpora. during training. They experimented using De- They learn a classifier using English annotated data noising Autoencoder (DAE) and Masked Sequence only and transfer it to any language without modifi- to Sequence (MASS) for three different unrelated 3 DHASA2021 language pairs (for English-Hindi, English-Bengali, Table 1: Corpus size for the Setswana and Sepedi and English-Gujarati). The analysis shows the im- Datasets portance of using cross-lingual embedding as com- Sepedi Setswana pared to other techniques. Number of tokens: 2133972 3000682 The literature shows that there is a substantial Unique words: 93461 107606 amount of work done on cross-lingual transfer and empirical proof that the method improves the per- formance of models. The literature does not relay solely on intrinsic evaluation but the solutions are all words to lowercase, removing brackets, digits, applied to some downstream tasks. In the next sec- punctuations, and white spaces. tion, we detail the data used for conducting experi- In this section we dealt with how we collected the ments. data used to training our monolingual embeddings for both languages and what approach we took 3 Data collection to pre-process the data before training the mod- Training data is very important for implementing els. In the next section we discuss the approach powerful and accurate models, and clean training taken to train the monolingual embeddings and data can make a difference between a good and great how VecMap was used to training the cross-lingual model. The data needs to be very imperative be- embeddings. cause the quality of the alignment depends on the quality of the monolingual embeddings, i.e. data used to create the initial monolingual embeddings before mapping. 4 Training monolingual and cross- We use data collected from different domains for lingual embeddings (VecMap) training word vectors: • JW300 bible (Agić & Vulić 2019): A biblical- In this section, we present the methods (frame- domain data set containing parallel corpus for works) used to train monolingual and cross-lingual low-resourced languages. embeddings. We describe the parameters used to • Wikipedia train word2Vec and fastText embeddings. We also look into VecMap, the framework that we used to • National Centre for Human Language align monolingual embeddings. Technology (NCHLT) text corpus (Eiselen & Puttkammer 2014): The dataset CLWEs have proved to perform very well for low- contains clean textual data in Sepedi and resourced languages. The main idea is to do a cross- Setswana. The data set was constructed lingual transfer from the source language to the tar- by harvesting existing data such as online get, such that we have a single representation for a publications, online news, web crawling and pair of languages where semantically similar words crowd-sourcing. are closer to one another. In order to use VecMap two monolingual embeddings are required, we train • SABC News Data in Setswana and Se- fastText and word2Vec vectors. We use the fol- pedi (Marivate et al. 2020, Marivate & Sefara lowing parameters for fastText and word2Vec in 2020b): The data set contains news titles col- Table 2. The definition of the parameters are as lected from online social media. follows: skipGram - training method, dim - size National Centre for Human Language Technol- of word vectors, minCount - minimal number of ogy (NCHLT) data is used for training monolin- word occurrences, ws - size of the context window, gual word vectors. For preprocessing, we changed and epoch - number of epochs or iterations. 4 Proceedings of the International Conference of the Digital Humanities Association of Southern Africa 2021 Table 2: Parameters for FastText and Word2Vec 4.3 VecMap Parameter Value VecMap (Artetxe et al. 2020) is an open-source skip-gram true framework to learn CLWEs written in Python. dim 300 There are two techniques to do cross-lingual em- minCount 1 beddings with VecMap, supervised (recommended ws 4 if you have a large training dictionary) and unsuper- epoch 100 vised (recommended if you have no seed dictionary and do not want to rely on identical words). In this work, we align word embedding using VecMap[1]. 4.1 Word2Vec The approach is fully unsupervised. The steps we followed to build our cross-lingual word em- The word2Vec (Mikolov et al. 2013) algorithm is a beddings model are motivated by the authors of two-layer neural network that vectorises words to VecMap Artetxe et al. (2020). The assumption processes text. The algorithm takes as input a text is that we have a monolingual corpus for source corpus and returns feature vectors that represent and target languages. The word representations is words in that corpus as a set of vectors. Word2Vec learned independently for each language (mono- trains words against neighbouring words based on a lingual embeddings for each language), and then window size context. It trains the words using two mapped to a common vector space. methods: skip-gram or continuous bag of words (CBOW), skip-gram uses a word to predict a target In this section, we presented word2Vec, fastText context and CBOW uses context to predict a tar- and VecMap. We also described the parameters used get word. The experiment uses skip-gram to train to train word2Vec and fastText embeddings. In the monolingual embeddings. We use word vectors that next section, we present experimental results and were trained using Word2Vec. These correspond to perform some analyses. monolingual embeddings of dimension 300 trained on Sepedi and Setswana corpora. 5 Evaluation We evaluate the quality of Setswana and Sepedi word vector representations on two different 4.2 FastText benchmarks Simlex and WordSim. The datasets (Simlex and WordSim) contain pairs of Setswana FastText (Bojanowski et al. 2016) is a supervised and Sepedi words that have been assigned similarity prediction-based technique based on the word2Vec ratings by humans. They give a similarity score family of algorithms (Mikolov et al. 2013). It pre- between a pair of words corresponding to their dicts tags through context and represents each word relatedness. Cosine similarity is used to collect as an n-gram of characters, instead of learning vec- a score from the model in order to check how tors for words directly. The fastText model has three close the score is to the human score, we use layers: input layer, hidden layer, and output layer. Spearman to measure correlation. Spearman index Input is a number of words and their n-gram fea- measure the dependence of two variables, the tures, these features are used to represent a single correlation of two statistical variables is evaluated document. The hidden layer is the superimposed using monotonic equation. We manually trans- average of multiple feature vectors. The hidden late the WordSim and Simlex word pairs dataset layer solves the maximum likelihood function, then from English into Setswana and Sepedi. We are constructs a Huffman tree according to the weights releasing a dataset of Setswana and Sepedi trans- and model parameters of each category, and uses the lated WordSim and Simlex as part of this project at Huffman tree as the output. https://github.com/dsfsi/embedding-eval-data 5 DHASA2021 Table 3: FastText Monolingual Results Table 4: Word2Vec Monolingual Results Monolingual fastText Coverage Spearman Monolingual word2Vec Coverage Spearman Sepedi(Simlex) 94.58 40.39 Sepedi(Simlex) 79.49 25.96 Sepedi(WordSim) 81.29 46.15 Sepedi(WordSim) 84.49 23.57 Setswana(Simlex) 95.22 33.23 Setswana(Simlex) 95.32 31.52 Setswana(WordSim) 95.38 44.80 Setswana(WordSim) 95.38 35.11 Table 5: Word2Vec Crosslingual Results and archived on Zenodo at https://zenodo.org/record/5673974. Monolingual word2Vec Coverage Spearman Setswana-Sepedi(Simlex) 90.76 31.14 Setswana-Sepedi(WordSim) 68.56 40.87 5.1 Results This section presents the results of the experi- source language to the target. In essence we wanted ments conducted to show the efficiency of the pro- to check if cross-lingual alignment can improve the posed technique with a couple of experiments. We word representation for the target language. The first present the monolingual evaluation task for results on Table 4 shows that the Spearman’s cor- word2Vec and fastText and then present the cross- relation value for the target language when using lingual evaluation task for Setswana and Sepedi. word2Vec is low, this is also due to coverage percent- The evaluations of cross-lingual evaluation task is age, but fastText based-embeddings perform bet- based on two embedding methods fastText and ter on Table 3 and has a higher coverage percent- word2Vec. age, as stated upove we expected 100 percent cov- In Table 3 and Table 4, we show the Spearman’s erage. Table 5 shows that we improved the rep- correlation for word vectors trained on fastText and resentation of words after cross-lingual alignment word2vec. The correlation scores calculate the sim- for word2Vec based-embeddings. The Spearman’s ilarity between word vectors. Table 5 and Table 6 value has increased for both Simlex and Wordsim. scores are obtained from using Setswana and Sepedi We expected to improve the results for fastText em- monolingual vectors and using VecMap to align the beddings but in this case word2Vec actually yielded two vectors to the same vector space. better results. The results at Table 3 and Table 4 show the cover- age and Spearman results. Coverage refers to the to- 7 Conclusion tal number of in vocabulary words (words that are In this paper, VecMap was used to align Setswana- found both in the model and evaluation dataset). Sepedi to the same vector space. Through this work, We can see that the coverage is lower for word2Vec we wanted to use cross-lingual (VecMap) technique but a little higher for fastText (we expected cover- to enable knowledge transfer between languages age for fastText to be 100 percent). The Simlex and with rich resources and low resources. The results WordSim similarity score for monolingual fastText show that it is possible to align two monolingual embeddings in Table 3 is higher, this is expected due embeddings to get cross-lingual embeddings. We to the coverage percentage also being very high as mapped Setswana to Sepedi and used Spearman’s to compared to the coverage value in Table 4. check correlation. Interestingly we get different re- sults on fastText and word2Vec-based embeddings 6 Discussion though we used the same data to train the embed- The main purpose of this research is to show that dings. it is possible to do cross-lingual transfer from the In future work, it would be interesting to use 6 Proceedings of the International Conference of the Digital Humanities Association of Southern Africa 2021 Table 6: FastText Crosslingual Results Association for Computational Linguistics (Vol- ume 1: Long Papers)’, pp. 789–798. Crosslingual fastText Coverage Spearman Setswana-Sepedi(Simlex) 91.19 30.44 Artetxe, M., Ruder, S. & Yogatama, D. (2020), On Setswana-Sepedi(WordSim) 68.84 36.33 the cross-lingual transferability of monolingual representations, in ‘Proceedings of the 58th An- nual Meeting of the Association for Computa- the cross-lingual embedding on a downstream task tional Linguistics’, pp. 4623–4637. like translation or sentiment analysis specifically for low-resourced languages. Artetxe, M. & Schwenk, H. (2019), ‘Massively multilingual sentence embeddings for zero-shot 8 Acknowledgements cross-lingual transfer and beyond’, Transactions We would like to acknowledge ABSA for sponsor- of the Association for Computational Linguistics ing the industry chair and it’s related activities to the 7, 597–610. project. Bakarov, A. (2018), ‘A survey of word embed- dings evaluation methods’, arXiv preprint References arXiv:1801.09536 . Abbott, J. & Martinus, L. (2019), Benchmarking neural machine translation for southern african Banerjee, T., au2, R. M. V. & Bhattacharyya, P. languages, in ‘Proceedings of the 2019 Workshop (2021), ‘Crosslingual embeddings are essential in on Widening NLP’, pp. 98–101. unmt for distant languages: An english to in- doaryan case study’. Adams, O., Makarucha, A., Neubig, G., Bird, S. & Cohn, T. (2017), Cross-lingual word embeddings Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. for low-resource language modeling, in ‘Proceed- (2016), ‘Enriching word vectors with subword in- ings of the 15th Conference of the European formation’, arXiv preprint arXiv:1607.04606 . Chapter of the Association for Computational Cho, K., Van Merriënboer, B., Gulcehre, C., Bah- Linguistics: Volume 1, Long Papers’, pp. 937– danau, D., Bougares, F., Schwenk, H. & Ben- 947. gio, Y. (2014), ‘Learning phrase representations Agić, Ž. & Vulić, I. (2019), JW300: A wide-coverage using rnn encoder-decoder for statistical machine parallel corpus for low-resource languages, in translation’, arXiv preprint arXiv:1406.1078 . ‘Proceedings of the 57th Annual Meeting of the Conneau, A., Lample, G., Ranzato, M., Denoyer, Association for Computational Linguistics’, As- L. & Jégou, H. (2018), ‘Word translation without sociation for Computational Linguistics, Flo- parallel data’. rence, Italy, pp. 3204–3210. URL: https://aclanthology.org/P19-1310 Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. (2018), ‘Bert: Pre-training of deep bidirectional Alabi, J. O., Amponsah-Kaakyire, K., Adelani, D. I. transformers for language understanding’, arXiv & España-Bonet, C. (2019), ‘Massive vs. curated preprint arXiv:1810.04805 . word embeddings for low-resourced languages. the case of yor\ub\’a and twi’, arXiv preprint Dinu, G., Lazaridou, A. & Baroni, M. (2015), ‘Im- arXiv:1912.02481 . proving zero-shot learning by mitigating the hub- ness problem’. Artetxe, M., Labaka, G. & Agirre, E. (2018), A ro- bust self-learning method for fully unsupervised Eiselen, R. & Puttkammer, M. J. (2014), Devel- cross-lingual mappings of word embeddings, in oping text resources for ten south african lan- ‘Proceedings of the 56th Annual Meeting of the guages., in ‘LREC’, pp. 3698–3703. 7 DHASA2021 Feng, X., Feng, X., Qin, B., Feng, Z. & Liu, T. question answering research’, Transactions of (2018), Improving low resource named entity the Association for Computational Linguistics recognition using cross-lingual knowledge trans- 7, 453–466. fer, in ‘Proceedings of the Twenty-Seventh In- Lample, G., Conneau, A., Ranzato, M., Denoyer, ternational Joint Conference on Artificial In- L. & Jégou, H. (2018), Word translation with- telligence, IJCAI-18’, International Joint Con- out parallel data, in ‘International Conference on ferences on Artificial Intelligence Organization, Learning Representations’. pp. 4071–4077. URL: https://doi.org/10.24963/ijcai.2018/566 Marivate, V. & Sefara, T. (2020a), Improving short Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, text classification through global augmentation E., Solan, Z., Wolfman, G. & Ruppin, E. (2001), methods, in ‘International Cross-Domain Con- Placing search in context: The concept revisited, ference for Machine Learning and Knowledge in ‘Proceedings of the 10th international confer- Extraction’, Springer, pp. 385–399. ence on World Wide Web’, pp. 406–414. Marivate, V. & Sefara, T. (2020b), ‘South african Guo, J., Che, W., Wang, H. & Liu, T. (2014), news data’. Revisiting embedding features for simple semi- URL: https://doi.org/10.5281/zenodo.3668495 supervised learning, in ‘Proceedings of the 2014 Marivate, V., Sefara, T., Chabalala, V., Makhaya, Conference on Empirical Methods in Natural K., Mokgonyane, T., Mokoena, R. & Mod- Language Processing (EMNLP)’, pp. 110–120. upe, A. (2020), Investigating an approach for Hedderich, M. A., Adelani, D., Zhu, D., Alabi, low resource language dataset creation, curation J., Markus, U. & Klakow, D. (2020), Transfer and classification: Setswana and sepedi, in ‘Pro- learning and distant supervision for multilingual ceedings of the first workshop on Resources for transformer models: A study on african lan- African Indigenous Languages’, pp. 15–20. guages, in ‘Proceedings of the 2020 Conference Martinus, L. & Abbott, J. Z. (2019), ‘A focus on Empirical Methods in Natural Language Pro- on neural machine translation for african lan- cessing (EMNLP)’, pp. 2580–2591. guages’, arXiv preprint arXiv:1906.05685 . Hill, F., Reichart, R. & Korhonen, A. (2015), Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. ‘Simlex-999: Evaluating semantic models with & Dean, J. (2013), Distributed representations of (genuine) similarity estimation’, Computational words and phrases and their compositionality, in Linguistics 41(4), 665–695. ‘Advances in neural information processing sys- Hochreiter, S. & Schmidhuber, J. (1997), ‘Long tems’, pp. 3111–3119. short-term memory’, Neural Computation Nekoto, W., Marivate, V., Matsila, T., Fasubaa, T., 9, 1735–1780. Fagbohungbe, T., Akinola, S. O., Muhammad, Howard, J. & Ruder, S. (2018), Universal language S., Kabenamualu, S. K., Osei, S., Sackey, F. et al. model fine-tuning for text classification, in ‘Pro- (2020), Participatory research for low-resourced ceedings of the 56th Annual Meeting of the Asso- machine translation: A case study in african lan- ciation for Computational Linguistics (Volume guages, in ‘Proceedings of the 2020 Conference 1: Long Papers)’, pp. 328–339. on Empirical Methods in Natural Language Pro- cessing: Findings’, pp. 2144–2160. Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, Pennington, J., Socher, R. & Manning, C. (2014), D., Polosukhin, I., Devlin, J., Lee, K. et al. Glove: Global vectors for word representation, (2019), ‘Natural questions: a benchmark for Vol. 14, pp. 1532–1543. 8 Proceedings of the International Conference of the Digital Humanities Association of Southern Africa 2021 Ranathunga, S., Lee, E.-S. A., Skenduli, M. P., Shekhar, R., Alam, M. & Kaur, R. (2021), ‘Neu- ral machine translation for low-resource lan- guages: A survey’, arXiv preprint arXiv:2106.15115 . Ruder, S., Peters, M. E., Swayamdipta, S. & Wolf, T. (2019), Transfer learning in natural language pro- cessing, in ‘Proceedings of the 2019 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Tutorials’, pp. 15–18. Sefara, T. J., Zwane, S. G., Gama, N., Sibisi, H., Senoamadi, P. N. & Marivate, V. (2021), Transformer-based machine translation for low- resourced languages embedded with language identification, in ‘2021 Conference on Informa- tion Communications Technology and Society (ICTAS)’, IEEE, pp. 127–132. Socher, R., Perelygin, A., Wu, J., Chuang, J., Man- ning, C. D., Ng, A. Y. & Potts, C. (2013), Recur- sive deep models for semantic compositionality over a sentiment treebank, in ‘Proceedings of the 2013 conference on empirical methods in natural language processing’, pp. 1631–1642. 9

References (33)

  1. Abbott, J. & Martinus, L. ( ), Benchmarking neural machine translation for southern african languages, in 'Proceedings of the Workshop on Widening NLP', pp. -.
  2. Adams, O., Makarucha, A., Neubig, G., Bird, S. & Cohn, T. ( ), Cross-lingual word embeddings for low-resource language modeling, in 'Proceed- ings of the th Conference of the European Chapter of the Association for Computational Linguistics: Volume , Long Papers', pp. - . Agić, Ž. & Vulić, I. ( ), JW : A wide-coverage parallel corpus for low-resource languages, in 'Proceedings of the th Annual Meeting of the Association for Computational Linguistics', As- sociation for Computational Linguistics, Flo- rence, Italy, pp. - . URL: https://aclanthology.org/P - Alabi, J. O., Amponsah-Kaakyire, K., Adelani, D. I. & Espa ña-Bonet, C. ( ), 'Massive vs. curated word embeddings for low-resourced languages. the case of yor\ub\'a and twi', arXiv preprint arXiv: . .
  3. Artetxe, M., Labaka, G. & Agirre, E. ( ), A ro- bust self-learning method for fully unsupervised cross-lingual mappings of word embeddings, in 'Proceedings of the th Annual Meeting of the Association for Computational Linguistics (Vol- ume : Long Papers)', pp.
  4. Artetxe, M., Ruder, S. & Yogatama, D. ( ), On the cross-lingual transferability of monolingual representations, in 'Proceedings of the th An- nual Meeting of the Association for Computa- tional Linguistics', pp. - .
  5. Artetxe, M. & Schwenk, H. ( ), 'Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond', Transactions of the Association for Computational Linguistics , -.
  6. Bakarov, A. ( ), 'A survey of word embed- dings evaluation methods', arXiv preprint arXiv: . .
  7. Banerjee, T., au , R. M. V. & Bhattacharyya, P. ( ), 'Crosslingual embeddings are essential in unmt for distant languages: An english to in- doaryan case study'.
  8. Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. ( ), 'Enriching word vectors with subword in- formation', arXiv preprint arXiv: . .
  9. Cho, K., Van Merriënboer, B., Gulcehre, C., Bah- danau, D., Bougares, F., Schwenk, H. & Ben- gio, Y. ( ), 'Learning phrase representations using rnn encoder-decoder for statistical machine translation', arXiv preprint arXiv: . .
  10. Conneau, A., Lample, G., Ranzato, M., Denoyer, L. & Jégou, H. ( ), 'Word translation without parallel data'.
  11. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. ( ), 'Bert: Pre-training of deep bidirectional transformers for language understanding', arXiv preprint arXiv: . .
  12. Dinu, G., Lazaridou, A. & Baroni, M. ( ), 'Im- proving zero-shot learning by mitigating the hub- ness problem'.
  13. Eiselen, R. & Puttkammer, M. J. ( ), Devel- oping text resources for ten south african lan- guages., in 'LREC', pp. - .
  14. Feng, X., Feng, X., Qin, B., Feng, Z. & Liu, T. ( ), Improving low resource named entity recognition using cross-lingual knowledge trans- fer, in 'Proceedings of the Twenty-Seventh In- ternational Joint Conference on Artificial In- telligence, IJCAI-', International Joint Con- ferences on Artificial Intelligence Organization, pp. - . URL: https://doi.org/ . /ijcai. /
  15. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G. & Ruppin, E. ( ), Placing search in context: The concept revisited, in 'Proceedings of the th international confer- ence on World Wide Web', pp. -.
  16. Guo, J., Che, W., Wang, H. & Liu, T. ( ), Revisiting embedding features for simple semi- supervised learning, in 'Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)', pp. -.
  17. Hedderich, M. A., Adelani, D., Zhu, D., Alabi, J., Markus, U. & Klakow, D. ( ), Transfer learning and distant supervision for multilingual transformer models: A study on african lan- guages, in 'Proceedings of the Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP)', pp. - .
  18. Hill, F., Reichart, R. & Korhonen, A. ( ), 'Simlex-: Evaluating semantic models with (genuine) similarity estimation', Computational Linguistics ( ), -.
  19. Hochreiter, S. & Schmidhuber, J. ( ), 'Long short-term memory', Neural Computation , - .
  20. Howard, J. & Ruder, S. ( ), Universal language model fine-tuning for text classification, in 'Pro- ceedings of the th Annual Meeting of the Asso- ciation for Computational Linguistics (Volume : Long Papers)', pp. -.
  21. Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K. et al. ( ), 'Natural questions: a benchmark for question answering research', Transactions of the Association for Computational Linguistics , -.
  22. Lample, G., Conneau, A., Ranzato, M., Denoyer, L. & Jégou, H. ( ), Word translation with- out parallel data, in 'International Conference on Learning Representations'.
  23. Marivate, V. & Sefara, T. ( a), Improving short text classification through global augmentation methods, in 'International Cross-Domain Con- ference for Machine Learning and Knowledge Extraction', Springer, pp. -.
  24. Marivate, V. & Sefara, T. ( b), 'South african news data'. URL: https://doi.org/ . /zenodo.
  25. Marivate, V., Sefara, T., Chabalala, V., Makhaya, K., Mokgonyane, T., Mokoena, R. & Mod- upe, A. ( ), Investigating an approach for low resource language dataset creation, curation and classification: Setswana and sepedi, in 'Pro- ceedings of the first workshop on Resources for African Indigenous Languages', pp. -.
  26. Martinus, L. & Abbott, J. Z. ( ), 'A focus on neural machine translation for african lan- guages', arXiv preprint arXiv: . .
  27. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. ( ), Distributed representations of words and phrases and their compositionality, in 'Advances in neural information processing sys- tems', pp. -.
  28. Nekoto, W., Marivate, V., Matsila, T., Fasubaa, T., Fagbohungbe, T., Akinola, S. O., Muhammad, S., Kabenamualu, S. K., Osei, S., Sackey, F. et al. ( ), Participatory research for low-resourced machine translation: A case study in african lan- guages, in 'Proceedings of the Conference on Empirical Methods in Natural Language Pro- cessing: Findings', pp. - .
  29. Pennington, J., Socher, R. & Manning, C. ( ), Glove: Global vectors for word representation, Vol. , pp. - .
  30. Ranathunga, S., Lee, E.-S. A., Skenduli, M. P., Shekhar, R., Alam, M. & Kaur, R. ( ), 'Neu- ral machine translation for low-resource lan- guages: A survey', arXiv preprint arXiv: . .
  31. Ruder, S., Peters, M. E., Swayamdipta, S. & Wolf, T. ( ), Transfer learning in natural language pro- cessing, in 'Proceedings of the Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Tutorials', pp. -.
  32. Sefara, T. J., Zwane, S. G., Gama, N., Sibisi, H., Senoamadi, P. N. & Marivate, V. ( ), Transformer-based machine translation for low- resourced languages embedded with language identification, in ' Conference on Informa- tion Communications Technology and Society (ICTAS)', IEEE, pp. -.
  33. Socher, R., Perelygin, A., Wu, J., Chuang, J., Man- ning, C. D., Ng, A. Y. & Potts, C. ( ), Recur- sive deep models for semantic compositionality over a sentiment treebank, in 'Proceedings of the conference on empirical methods in natural language processing', pp. - .