Understanding the main concepts and tools used in NLP
When processing documents, the first analytical step is certainly to infer the document language. Most analytical engines used in NLP tasks are in fact trained on documents that have a specific language and should only be used for such a language. Although attempts to build cross-language models (see, for instance, multi-lingual embeddings such as https://fasttext.cc/docs/en/aligned-vectors.html and https://github.com/google-research/bert/blob/master/multilingual.md) have recently gained increasing popularity, these models face challenges, such as lower performance compared to language-specific models and difficulty in handling languages with sparse training data or significantly different syntax and grammatical structures. For these reasons, they still represent a small portion of NLP models. It is, therefore, very common to first infer the language to use the correct downstream analytical NLP pipeline.
In order to infer the...