Quality & Testing

How we measure answer quality, track retrieval performance, and improve over time.

How the system answers questions

Understand: We read your question and detect intent (direct answer vs. browsing articles).
Find: We search our content and select the most relevant passages and articles.
Draft: We write a concise answer and cite sources or list articles.
Evaluate: We log what was found and how it ranked, then score quality automatically.

Retrieval metrics (what we track)

Precision@k

Of the top results we showed, how many were truly relevant? Higher = fewer off-topic sources.

Recall@k

How much of the right information did we find? Higher = more complete answers.

Ranking quality (MRR)

Checks whether the best source appears near the top. Higher = less scrolling to the good stuff.

Coverage & timing

Did we find enough recent content when the question asked for "recent" or a time period?

We calculate these per brand and for specific test sets so we can compare apples-to-apples over time.

Answer quality (automatic checks)

Hallucination: Does the answer introduce information that isn’t in the sources? (Target: below 5%.)
Completeness (1–5): Does the answer fully address the question using the sources?
Relevance (1–5): How well does the answer stay on-topic?
Temporal fit (1–3): When a question asks for a time period, do sources match that period?

We also run an automated reviewer ("LLM judge") in batches. It reads only the same sources we used and scores answers for hallucination, completeness, relevance, and temporal fit. Results are stored for trend tracking.

How we use these metrics

If precision is low

Tune ranking weights and boosts
Improve query understanding
Add brand-specific synonyms

If recall is low

Expand content coverage
Adjust time-window rules
Refine entity/keyword extraction

If ranking is off

Rebalance vector vs. keyword signals
Boost recency and section cues
Strengthen temporal matching

If quality drops

Review judge scores and comments
Address hallucination/coverage hot spots
Tweak prompts and citations

Targets we optimize for

Hallucinations: below 5%
Completeness & Relevance: ≥ 4.0 / 5.0 on average
Ranking quality (MRR): improving month-over-month
Recall@k: ≥ 95% for known answers
Response time: under 15–20s for article discovery; faster for direct answers

We avoid brand-by-brand score dumps on this page; stakeholders get clear goals and progress instead.

Ground truth test sets

We keep curated test questions per brand. Results (precision, recall, ranking quality and quality scores) are compared over time to validate improvements. Editorial “gold” examples help us lock in high-quality answers.