How we measure answer quality, track retrieval performance, and improve over time.
Of the top results we showed, how many were truly relevant? Higher = fewer off-topic sources.
How much of the right information did we find? Higher = more complete answers.
Checks whether the best source appears near the top. Higher = less scrolling to the good stuff.
Did we find enough recent content when the question asked for "recent" or a time period?
We also run an automated reviewer ("LLM judge") in batches. It reads only the same sources we used and scores answers for hallucination, completeness, relevance, and temporal fit. Results are stored for trend tracking.
We avoid brand-by-brand score dumps on this page; stakeholders get clear goals and progress instead.
We keep curated test questions per brand. Results (precision, recall, ranking quality and quality scores) are compared over time to validate improvements. Editorial “gold” examples help us lock in high-quality answers.