Skip to content

How Scoring Works in Quepid

Jon Senchyna edited this page Feb 21, 2020 · 7 revisions

To measure how good your search quality is, we need an Evaluation Measure, which in Quepid parlance is a Scorer. Today Quepid only supports a single scorer per evaluation (in contrast to tools like RRE) that can provide multiple metrics.

NDCG

The default scorer is a NDCG@10 with a rating scale of 1 to 4. The way to think about these ratings when you are using them is:

  1. A One is Poor, and it's a document that makes you actively upset with your search engine! It's a BAD result.
  2. A Two is Fair, and it represents an irrelevant document. You understand why it matched, but it's not relevant.
  3. A Three is Good, and it is a relevant document. The search engine made sense on why it returned the document.
  4. A Four is Perfect, and this is a perfect match. There isn't any ambiguity about why that document matched.

It's okay for there to NOT be any fours, especially in a query that is relatively exploratory in nature. For example, if you searched for "best movie", well it would be hard, unless the engine knows your favorite movie to return documents that would be rated a 4. However, if you type in "Star Wars" and get back "Star Wars a New Hope", well that looks like a Four. But returning "Star Wars: The Phantom Menace", well that is probably a Three. It's relevant, but it wasn't exactly what I wanted!

Ideal Situation

When scoring your documents, you may see that it's possible to score a perfect 100 even though all of your documents are rated with the lowest score.

Score 100 with all 1's

One weird thing about NDCG is that it works with the ideal ordering of rated documents. Ie 4,3,2,1 and 3,3,1 and 1,1,1 all score the same. This means that no matter the rating, if the final results meet the ideal ordering, the score will be 100. So, if you score ten documents as 1,4,1,1,1,1,1,1,1,1, you get a 72. Tweak your algorithm to move that 4 to be first so that it's sorted as 4,1,1,1,1,1,1,1,1,1 and boom, you are back to 100 (even though 90% of the documents were rated as irrelevant)!

Local vs Global Scoring

Another issue is that the default NDCG@10 scorer only looks at the documents that are returned by the search engine. This is sometimes called the "NDCG Local" variant. If you know that there are other highly relevant documents that you've scored, using the Explain Other feature to find and score them, then you would think they would contribute to the score? Imagine if your search engine returns a single document rated a 1, then you currently get 100 as the score.

explain other

If you then went and used Explain Other to find some other docs that you score as a 1,4,4, then you would think NDCG would look at the score as 1,1,4,4, and give you a very low score for not having 4's higher! That would be the "NDCG Global" variant, which is currently not supported. This is tracked in https://github.com/o19s/quepid/issues/78.