Skip to content

Beyond accuracy

A top-N recommender that nails precision@k and NDCG@k can still be a bad product: it recommends the same blockbusters to everyone, never surfaces the long tail, and shows the user nothing they couldn't have found themselves. recommender_systems.metrics ships four beyond-accuracy metrics so the benchmark catches that.

All four take the same per-user shape as the ranking metrics — parallel sequences of predicted ranked lists and (for serendipity) held-out relevant items — and return a macro-averaged float across users.

Intra-list diversity

How different the items inside one user's top-N are from each other. Macro-averages 1 - similarity(item_a, item_b) over distinct pairs in the top-k. A list of ten near-identical items has diversity near zero; a list of ten unrelated items has diversity near one.

from recommender_systems.metrics import intra_list_diversity

def jaccard(a: str, b: str) -> float:
    sa, sb = set(a.split()), set(b.split())
    return len(sa & sb) / max(1, len(sa | sb))

predicted = [["sci-fi space", "sci-fi space alien", "regency romance"]]
ild = intra_list_diversity(predicted, jaccard, k=3)
# Two sci-fi items overlap heavily; the romance pulls diversity up.

Novelty

Mean self-information of the recommended items: how much the recommender surfaces things a user couldn't have stumbled on by themselves. Items that everyone interacts with carry little novelty; long-tail items carry more.

from recommender_systems.metrics import novelty

# popularity is fraction of users who interacted with each item.
item_popularity = {"a": 0.5, "b": 0.25, "c": 0.05}
predicted = [["a", "b", "c"]]
n = novelty(predicted, item_popularity, k=3)
# c (rarely seen) contributes much more than a (everyone has seen it).

Catalog coverage

The simplest of the four — fraction of the catalog any user ever sees in their top-k. A recommender that fixates on a handful of popular items has coverage near zero; one that exposes the long tail has high coverage.

from recommender_systems.metrics import catalog_coverage

catalog = {"a", "b", "c", "d", "e"}
predicted = [["a", "b"], ["a", "c"], ["a", "d"]]
coverage = catalog_coverage(predicted, catalog, k=2)
# a, b, c, d appear in some user's top-2 — coverage 4/5 = 0.8.

The MovieLens and goodbooks benchmarks both report coverage@10. The most striking observation: MeanRating with min_ratings=5 covers about 1.6% of the catalog; ItemKNN covers ~29%. Accuracy parity doesn't say anything about reach.

Serendipity

How often the top-N contains items a user finds both relevant and unexpected — items they wouldn't have gotten from a popularity baseline. Captures the difference between "accurate and obvious" and "accurate and surprising".

from recommender_systems.baselines import MostPopular
from recommender_systems.metrics import serendipity_at_k

# Per user: the top-N a trivial baseline would have given them.
baseline = MostPopular().fit(train)
expected = [set(baseline.recommend(u, n=10)) for u in users]

predicted = [model.recommend(u, n=10) for u in users]
actual = [truth.get(u, set()) for u in users]
ser = serendipity_at_k(predicted, actual, expected, k=10)

A MostPopular recommender has zero serendipity by construction — its recommendations are the baseline.

Two recommenders with the same precision@10 are not the same

The MovieLens benchmark shows UserKNN and ItemKNN essentially tied on accuracy (precision@10 of 0.3199 vs 0.3240) but with very different catalog coverage (0.2170 vs 0.2919). The accuracy metrics say both retrieve correct items; coverage tells you ItemKNN does so across a broader slice of the library. Whether one is better depends on the product question, not the precision score.