Providing a quick overview of a dataset
In order to show you how to process a corpus of documents with the aim of extracting relevant information, we will be using a dataset derived from a well-known benchmark in the field of NLP: the so-called Reuters-21578 dataset. The original dataset includes a set of 21,578 news articles published in the Reuters financial newswire in 1987, which were assembled and indexed in categories. The original dataset has a very skewed distribution, with some categories appearing in only the training set or the test set. For this reason, we use a modified version named ApteMod, also referred to as Reuters-21578 Distribution 1.0, which has a lesser skew distribution and consistent labels between training and test datasets.
Despite the fact that the news articles in the Reuters financial newswire are a bit outdated, the dataset has been used in a plethora of papers on NLP and still represents a dataset often used for benchmarking algorithms. Nevertheless...