English version of the article
<aside> 💡
Other possible titles:
When we ask an AI system a question, it can sometimes give an answer that sounds convincing but isn’t backed up by the right information. In everyday situations, this might not matter much, but in health care—where accuracy and trust are essential—the stakes are much higher.
One promising approach to reduce this risk is called Retrieval-Augmented Generation (RAG). Instead of relying only on what it has memorized during training, a RAG system first searches trusted sources—like medical guidelines or scientific articles—and then uses that information to generate its response. In other words, it doesn’t just invent an answer; it looks things up before speaking.
This design makes RAG especially valuable in health care, where knowledge is constantly evolving and every answer must be both reliable and safe. But it also highlights why careful evaluation is necessary: we need to ensure the system retrieves the right sources, interprets them correctly, and communicates its conclusions in a way health professionals can trust. Evaluating RAG in this context is about much more than performance—it’s about ensuring technology truly supports better decisions and, ultimately, better care.
Traditional automatic evaluation metrics such as BLEU and ROUGE—originally designed for machine translation and summarization—fall short when applied to retrieval-augmented generation (RAG) systems. These overlap-based metrics primarily capture surface lexical similarity rather than deeper dimensions such as grounding, factuality, completeness, or clinical relevance. Recent critiques emphasize that no single metric consistently correlates with human judgment across diverse generative tasks.¹ In the context of RAG, Zhu et al.² demonstrate how traditional methods fail to assess retrieval-generation alignment. A concern that is echoed in a comprehensive survey of RAG evaluation approaches³, underscoring the need for multi-dimensional evaluation strategies that reflect real-world performance. Beyond RAG, Weidinger et al.⁴ call for an “evaluation science” that prioritizes construct validity and transparency over metric benchmarking, This work, among others, signals a shift from metric-driven benchmarking toward meaning-driven evaluation, better aligned with the complex reasoning tasks RAG systems are designed to support.
Following this line of thought, at Clinia, we have designed our own set of medical and linguistic criteria to evaluate responses from both the content perspective (Is the information correct, complete, and safe?) and the form perspective (Is it clear, respectful, and useful to the reader?). This ensures that our evaluation reflects trustworthiness and patient safety.
These guidelines ensure the semantic validity and scientific robustness of model outputs, which are the foundation of trustworthiness in health care AI.
In health care, even slightly off-topic content can waste time or mislead. Therefore, responses must directly address the user's query. We evaluate relevance on a graded scale:
✅ Relevant response – Response answers the question fully and precisely. 🟡 Contextually Related – Response relates to the topic but doesn’t address the exact question. ❌ Irrelevant – Content is off-topic or misleading.
Let’s consider as an example the question What is hypoglossal nerve stimulation? and the three following responses:
✅ Hypoglossal nerve stimulation is a medical treatment used for obstructive sleep apnea (OSA). It involves the use of an implanted device that stimulates the hypoglossal nerve, which controls tongue movement. By stimulating this nerve during sleep, the device helps to keep the airway open, reducing apneic events and improving breathing.
🟡 Obstructive sleep apnea is a condition where the airway becomes blocked during sleep, causing breathing interruptions. Various treatments exist for this condition, including lifestyle changes, hypoglossal nerve stimulation, CPAP machines, and surgical options.
❌ The amygdala is a small, almond-shaped cluster of nuclei located deep within the temporal lobes of the brain.
This example highlights how relevance is not just about correctness, but about usefulness in context. A response that goes straight to the point enables faster, safer decision-making, while tangential or unrelated content risks distracting the user from what truly matters.