Training Cross-Lingual embeddings for Setswana and Sepedi
2021, Journal of the Digital Humanities Association of Southern Africa (DHASA)
https://doi.org/10.55492/DHASA.V3I03.3822…
9 pages
1 file
Sign up for access to the world's latest research
Abstract
How can we transfer semantic information between two low-resourced African languages? With one language having more resources than the other? The problem is that African languages still lag in the advances of Natural Language Processing techniques, one reason being the lack of representative data, having a technique that can transfer information between languages can help mitigate against the data problem. This paper trains Setswana and Sepedi monolingual word vectors and uses VecMap to create cross-lingual embeddings for Setswana-Sepedi since cross-lingual embeddings can be used as a method of transferring semantic information from rich to low-resourced languages. Word embeddings are word vectors that represent words as continuous floating numbers where semantically similar words are mapped to nearby points in n-dimensional space. Each point captures the meaning of a word with semantically similar words having similar vector values near the point. The vectors are captured in a mann...
Related papers
Word embeddings are widely used in natural language processing (NLP) tasks. Most work on word embeddings focuses on monolingual languages with large available datasets. For embeddings to be useful in a multilingual environment, as in South Africa, the training techniques have to be adjusted to cater for a) multiple languages, b) smaller datasets and c) the occurrence of code-switching. One of the biggest roadblocks is to obtain datasets that include examples of natural code-switching, since code switching is generally avoided in written material. A solution to this problem is to use speech recognised data. Embedding packages like Word2Vec and GloVe have default hyper-parameter settings that are usually optimised for training on large datasets and evaluation on analogy tasks. When using embeddings for problems such as text classification in our multilingual environment, the hyper-parameters have to be optimised for the specific data and task. We investigate the importance of optimisi...
ArXiv, 2022
Cross-Lingual Word Embeddings (CLWEs) are a key component to transfer linguistic information learnt from higher-resource settings into lower-resource ones. Recent research in cross-lingual representation learning has focused on offline mapping approaches due to their simplicity, computational efficacy, and ability to work with minimal parallel resources. However, they crucially depend on the assumption of embedding spaces being approximately isomorphic i.e. sharing similar geometric structure, which does not hold in practice, lead-ing to poorer performance on low-resource and distant language pairs. In this paper, we introduce a framework to learn CLWEs, without assuming isometry, for low-resource pairs via joint exploitation of a related higher-resource language. In our work, we first pre-align the low-resource and related language embedding spaces using offline methods to mitigate the assumption of isometry. Following this, we use joint training methods to develops CLWEs for the relat...
Digital, 2021
This article presents a systematic literature review on quantifying the proximity between independently trained monolingual word embedding spaces. A search was carried out in the broader context of inducing bilingual lexicons from cross-lingual word embeddings, especially for low-resource languages. The returned articles were then classified. Cross-lingual word embeddings have drawn the attention of researchers in the field of natural language processing (NLP). Although existing methods have yielded satisfactory results for resource-rich languages and languages related to them, some researchers have pointed out that the same is not true for low-resource and distant languages. In this paper, we report the research on methods proposed to provide better representation for low-resource and distant languages in the cross-lingual word embedding space.
2020
Dense word vectors or ‘word embeddings’ which encode semantic properties of words, have now become integral to NLP tasks like Machine Translation (MT), Question Answering (QA), Word Sense Disambiguation (WSD), and Information Retrieval (IR). In this paper, we use various existing approaches to create multiple word embeddings for 14 Indian languages. We place these embeddings for all these languages, viz., Assamese, Bengali, Gujarati, Hindi, Kannada, Konkani, Malayalam, Marathi, Nepali, Odiya, Punjabi, Sanskrit, Tamil, and Telugu in a single repository. Relatively newer approaches that emphasize catering to context (BERT, ELMo, etc.) have shown significant improvements, but require a large amount of resources to generate usable models. We release pre-trained embeddings generated using both contextual and non-contextual approaches. We also use MUSE and XLM to train cross-lingual embeddings for all pairs of the aforementioned languages. To show the efficacy of our embeddings, we evalua...
Proceedings of the AAAI Conference on Artificial Intelligence
The lack of annotated data in many languages is a well-known challenge within the field of multilingual natural language processing (NLP). Therefore, many recent studies focus on zero-shot transfer learning and joint training across languages to overcome data scarcity for low-resource languages. In this work we (i) perform a comprehensive comparison of state-of-the-art multilingual word and sentence encoders on the tasks of named entity recognition (NER) and part of speech (POS) tagging; and (ii) propose a new method for creating multilingual contextualized word embeddings, compare it to multiple baselines and show that it performs at or above state-of-the-art level in zero-shot transfer settings. Finally, we show that our method allows for better knowledge sharing across languages in a joint training setting.
2020
Distributed word embeddings have become ubiquitous in natural language processing as they have been shown to improve performance in many semantic and syntactic tasks. Popular models for learning cross-lingual word embeddings do not consider the morphology of words. We propose an approach to learn bilingual embeddings using parallel data and subword information that is expressed in various forms, i.e. character n-grams, morphemes obtained by unsupervised morphological segmentation and byte pair encoding. We report results for three low resource morphologically rich languages (Swahili, Tagalog, and Somali) and a high resource language (German) in a simulated a low-resource scenario. Our results show that our method that leverages subword information outperforms the model without subword information, both in intrinsic and extrinsic evaluations of the learned embeddings. Specifically, analogy reasoning results show that using subwords helps capture syntactic characteristics. Semanticall...
2020
This paper presents the first ever comprehensive evaluation of different types of word embeddings for Sinhala language. Three standard word embedding models, namely, Word2Vec (both Skipgram and CBOW), FastText, and Glove are evaluated under two types of evaluation methods: intrinsic evaluation and extrinsic evaluation. Word analogy and word relatedness evaluations were performed in terms of intrinsic evaluation, while sentiment analysis and part-of-speech (POS) tagging were conducted as the extrinsic evaluation tasks. Benchmark datasets used for intrinsic evaluations were carefully crafted considering specific linguistic features of Sinhala. In general, FastText word embeddings with 300 dimensions reported the finest accuracies across all the evaluation tasks, while Glove reported the lowest results.
arXiv: Computation and Language, 2021
Cross-Lingual Word Embeddings (CLWEs) encode words from two or more languages in a shared high-dimensional space in which vectors representing words with similar meaning (regardless of language) are closely located. Existing methods for building highquality CLWEs learn mappings that minimise the 2 norm loss function. However, this optimisation objective has been demonstrated to be sensitive to outliers. Based on the more robust Manhattan norm (aka. 1 norm) goodnessof-fit criterion, this paper proposes a simple post-processing step to improve CLWEs. An advantage of this approach is that it is fully agnostic to the training process of the original CLWEs and can therefore be applied widely. Extensive experiments are performed involving ten diverse languages and embeddings trained on different corpora. Evaluation results based on bilingual lexicon induction and crosslingual transfer for natural language inference tasks show that the 1 refinement substantially outperforms four state-of-the-art baselines in both supervised and unsupervised settings. It is therefore recommended that this strategy be adopted as a standard for CLWE methods.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018
Much work in Natural Language Processing (NLP) has been for resource-rich languages, making generalization to new, less-resourced languages challenging. We present two approaches for improving generalization to lowresourced languages by adapting continuous word representations using linguistically motivated subword units: phonemes, morphemes and graphemes. Our method requires neither parallel corpora nor bilingual dictionaries and provides a significant gain in performance over previous methods relying on these resources. We demonstrate the effectiveness of our approaches on Named Entity Recognition for four languages, namely Uyghur, Turkish, Bengali and Hindi, of which Uyghur and Bengali are low resource languages, and also perform experiments on Machine Translation. Exploiting subwords with transfer learning gives us a boost of +15.2 NER F1 for Uyghur and +9.7 F1 for Bengali. We also show improvements in the monolingual setting where we achieve (avg.) +3 F1 and (avg.) +1.35 BLEU.
International Journal on Web Service Computing (IJWSC), 2019
Recent advances in generating monolingual word embeddings based on word co-occurrence for universal languages inspired new efforts to extend the model to support diversified languages. State-of-the-art methods for learning cross-lingual word embeddings rely on the alignment of monolingual word embedding spaces. Our goal is to implement a word co-occurrence across languages with the universal concepts' method. Such concepts are notions that are fundamental to humankind and are thus persistent across languages, e.g., a man or woman, war or peace, etc. Given bilingual lexicons, we built universal concepts as undirected graphs of connected nodes and then replaced the words belonging to the same graph with a unique graph ID. This intuitive design makes use of universal concepts in monolingual corpora which will help generate meaningful word embeddings across languages via the word co-occurrence concept. Standardized benchmarks demonstrate how this underutilized approach competes SOTA...
References (33)
- Abbott, J. & Martinus, L. ( ), Benchmarking neural machine translation for southern african languages, in 'Proceedings of the Workshop on Widening NLP', pp. -.
- Adams, O., Makarucha, A., Neubig, G., Bird, S. & Cohn, T. ( ), Cross-lingual word embeddings for low-resource language modeling, in 'Proceed- ings of the th Conference of the European Chapter of the Association for Computational Linguistics: Volume , Long Papers', pp. - . Agić, Ž. & Vulić, I. ( ), JW : A wide-coverage parallel corpus for low-resource languages, in 'Proceedings of the th Annual Meeting of the Association for Computational Linguistics', As- sociation for Computational Linguistics, Flo- rence, Italy, pp. - . URL: https://aclanthology.org/P - Alabi, J. O., Amponsah-Kaakyire, K., Adelani, D. I. & Espa ña-Bonet, C. ( ), 'Massive vs. curated word embeddings for low-resourced languages. the case of yor\ub\'a and twi', arXiv preprint arXiv: . .
- Artetxe, M., Labaka, G. & Agirre, E. ( ), A ro- bust self-learning method for fully unsupervised cross-lingual mappings of word embeddings, in 'Proceedings of the th Annual Meeting of the Association for Computational Linguistics (Vol- ume : Long Papers)', pp.
- Artetxe, M., Ruder, S. & Yogatama, D. ( ), On the cross-lingual transferability of monolingual representations, in 'Proceedings of the th An- nual Meeting of the Association for Computa- tional Linguistics', pp. - .
- Artetxe, M. & Schwenk, H. ( ), 'Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond', Transactions of the Association for Computational Linguistics , -.
- Bakarov, A. ( ), 'A survey of word embed- dings evaluation methods', arXiv preprint arXiv: . .
- Banerjee, T., au , R. M. V. & Bhattacharyya, P. ( ), 'Crosslingual embeddings are essential in unmt for distant languages: An english to in- doaryan case study'.
- Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. ( ), 'Enriching word vectors with subword in- formation', arXiv preprint arXiv: . .
- Cho, K., Van Merriënboer, B., Gulcehre, C., Bah- danau, D., Bougares, F., Schwenk, H. & Ben- gio, Y. ( ), 'Learning phrase representations using rnn encoder-decoder for statistical machine translation', arXiv preprint arXiv: . .
- Conneau, A., Lample, G., Ranzato, M., Denoyer, L. & Jégou, H. ( ), 'Word translation without parallel data'.
- Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. ( ), 'Bert: Pre-training of deep bidirectional transformers for language understanding', arXiv preprint arXiv: . .
- Dinu, G., Lazaridou, A. & Baroni, M. ( ), 'Im- proving zero-shot learning by mitigating the hub- ness problem'.
- Eiselen, R. & Puttkammer, M. J. ( ), Devel- oping text resources for ten south african lan- guages., in 'LREC', pp. - .
- Feng, X., Feng, X., Qin, B., Feng, Z. & Liu, T. ( ), Improving low resource named entity recognition using cross-lingual knowledge trans- fer, in 'Proceedings of the Twenty-Seventh In- ternational Joint Conference on Artificial In- telligence, IJCAI-', International Joint Con- ferences on Artificial Intelligence Organization, pp. - . URL: https://doi.org/ . /ijcai. /
- Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G. & Ruppin, E. ( ), Placing search in context: The concept revisited, in 'Proceedings of the th international confer- ence on World Wide Web', pp. -.
- Guo, J., Che, W., Wang, H. & Liu, T. ( ), Revisiting embedding features for simple semi- supervised learning, in 'Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)', pp. -.
- Hedderich, M. A., Adelani, D., Zhu, D., Alabi, J., Markus, U. & Klakow, D. ( ), Transfer learning and distant supervision for multilingual transformer models: A study on african lan- guages, in 'Proceedings of the Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP)', pp. - .
- Hill, F., Reichart, R. & Korhonen, A. ( ), 'Simlex-: Evaluating semantic models with (genuine) similarity estimation', Computational Linguistics ( ), -.
- Hochreiter, S. & Schmidhuber, J. ( ), 'Long short-term memory', Neural Computation , - .
- Howard, J. & Ruder, S. ( ), Universal language model fine-tuning for text classification, in 'Pro- ceedings of the th Annual Meeting of the Asso- ciation for Computational Linguistics (Volume : Long Papers)', pp. -.
- Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K. et al. ( ), 'Natural questions: a benchmark for question answering research', Transactions of the Association for Computational Linguistics , -.
- Lample, G., Conneau, A., Ranzato, M., Denoyer, L. & Jégou, H. ( ), Word translation with- out parallel data, in 'International Conference on Learning Representations'.
- Marivate, V. & Sefara, T. ( a), Improving short text classification through global augmentation methods, in 'International Cross-Domain Con- ference for Machine Learning and Knowledge Extraction', Springer, pp. -.
- Marivate, V. & Sefara, T. ( b), 'South african news data'. URL: https://doi.org/ . /zenodo.
- Marivate, V., Sefara, T., Chabalala, V., Makhaya, K., Mokgonyane, T., Mokoena, R. & Mod- upe, A. ( ), Investigating an approach for low resource language dataset creation, curation and classification: Setswana and sepedi, in 'Pro- ceedings of the first workshop on Resources for African Indigenous Languages', pp. -.
- Martinus, L. & Abbott, J. Z. ( ), 'A focus on neural machine translation for african lan- guages', arXiv preprint arXiv: . .
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. ( ), Distributed representations of words and phrases and their compositionality, in 'Advances in neural information processing sys- tems', pp. -.
- Nekoto, W., Marivate, V., Matsila, T., Fasubaa, T., Fagbohungbe, T., Akinola, S. O., Muhammad, S., Kabenamualu, S. K., Osei, S., Sackey, F. et al. ( ), Participatory research for low-resourced machine translation: A case study in african lan- guages, in 'Proceedings of the Conference on Empirical Methods in Natural Language Pro- cessing: Findings', pp. - .
- Pennington, J., Socher, R. & Manning, C. ( ), Glove: Global vectors for word representation, Vol. , pp. - .
- Ranathunga, S., Lee, E.-S. A., Skenduli, M. P., Shekhar, R., Alam, M. & Kaur, R. ( ), 'Neu- ral machine translation for low-resource lan- guages: A survey', arXiv preprint arXiv: . .
- Ruder, S., Peters, M. E., Swayamdipta, S. & Wolf, T. ( ), Transfer learning in natural language pro- cessing, in 'Proceedings of the Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Tutorials', pp. -.
- Sefara, T. J., Zwane, S. G., Gama, N., Sibisi, H., Senoamadi, P. N. & Marivate, V. ( ), Transformer-based machine translation for low- resourced languages embedded with language identification, in ' Conference on Informa- tion Communications Technology and Society (ICTAS)', IEEE, pp. -.
- Socher, R., Perelygin, A., Wu, J., Chuang, J., Man- ning, C. D., Ng, A. Y. & Potts, C. ( ), Recur- sive deep models for semantic compositionality over a sentiment treebank, in 'Proceedings of the conference on empirical methods in natural language processing', pp. - .