Evaluating the output
In information retrieval, we are interested in measuring whether a found document is either relevant or irrelevant. Therefore, the most commonly used valuation metrics are precision and recall. Precision is the fraction of retrieved documents that are relevant, while recall is the fraction of relevant documents that are successfully retrieved. Consider a query in which R represents all relevant documents and NR represents the irrelevant ones in a corpus of documents D. Rq represents the relevant documents found and Dq is the documents returned by the system. We can define the two metrics as follows:
The problem with these two metrics is that they do not return goodness of ranking, only whether we are finding all relevant documents or the percentage of relevant documents in the total. Usually, when we use a retriever, we select a number (k) of documents that we use for context (top-k), so we need a metric that takes ranking into account...