English version of the article
<aside> 💡
Other possible titles:
When we ask an AI system a question, it can sometimes give an answer that sounds convincing but isn’t backed up by the right information. In everyday situations, this might not matter much, but in health care—where accuracy and trust are essential—the stakes are much higher.
One promising approach to reduce this risk is called Retrieval-Augmented Generation (RAG). Instead of relying only on what it has memorized during training, a RAG system first searches trusted sources—like medical guidelines or scientific articles—and then uses that information to generate its response. In other words, it doesn’t just invent an answer; it looks things up before speaking.
This design makes RAG especially valuable in health care, where knowledge is constantly evolving and every answer must be both reliable and safe. But it also highlights why careful evaluation is necessary: we need to ensure the system retrieves the right sources, interprets them correctly, and communicates its conclusions in a way health professionals can trust. Evaluating RAG in this context is about much more than performance—it’s about ensuring technology truly supports better decisions and, ultimately, better care.
Most standard AI evaluation methods come from older natural language processing (NLP) research. Metrics like BLEU or ROUGE compare the AI’s answer to a reference answer by counting word overlaps. This may work well for tasks like translation or summarization but quickly shows its limits for generative AI in health care (Novikova et al., 2017).
For this reason, we need a different approach. At Clinia, we have designed our own set of medical and linguistic criteria to evaluate responses from both the content perspective (Is the information correct, complete, and safe?) and the form perspective (Is it clear, respectful, and useful to the reader?). This ensures that our evaluation reflects trustworthiness and patient safety.
These guidelines ensure the semantic validity and scientific robustness of model outputs, which are the foundation of trustworthiness in health care AI.