Abstract

Drug discovery remains a highly resource-intensive process, often constrained by the prohibitive costs of developing vaccines and therapeutics with limited economic value. During the Aiffel program at Modulabs, I participated in a collaborative project with a startup applying Large Language Models (LLMs) to accelerate structure–activity relationship (SAR) analysis and report generation. My primary role involved designing evaluation metrics and methodologies to assess the algorithms and applications developed by the team. This blog series summarizes the technical concepts, case studies, and evaluation frameworks encountered throughout the project, with a focus on how LLMs can support drug discovery by reducing opportunity costs and enhancing reproducibility.

Keywords

Drug Discovery, Structure–Activity Relationship (SAR), Activity Cliff, RDKit, SMILES, Fingerprints, Large Language Models (LLMs), Evaluation Metrics, Hallucination Mitigation, HealthBench, PharmaSwarm, RAG, RAGAS

image.png


Prologue

During the final stages of the COVID-19 pandemic, global health initiatives such as GAVI and The Global Fund gained increasing prominence in the governance of international health. At that time, I was contemplating whether to enroll in a course on Pharmaceutical Economics and Policy. Encouraged by the professor, I audited both the lectures and seminars, eventually completing the course. The professor’s main research fields were Pharmaceutical R&D and Intellectual Property (IP), which allowed me to expand my knowledge on the introduction of high-cost vaccine programs, drug development for tropical and rare diseases, the drug discovery and development cycle, and the regulatory systems in the United States and Europe. A key insight I gained was that drug development incurs substantial costs, often leading to the abandonment of vaccines or therapies considered to have limited economic value due to opportunity cost considerations.

Building on this foundation, I joined a project in the Aiffel program that leveraged LLMs for drug discovery and development. The project used real-world data from a startup actively experimenting with AI systems for automated SAR analysis. My motivation was to understand how such systems operate in practice and how they might alleviate opportunity costs in drug development.

In the project, my role was to evaluate the algorithms and applications produced by the team. This included designing evaluation metrics and validation methodologies. During the initial 1–2 weeks, I focused on acquiring domain-specific knowledge; in the mid-phase, I worked on the technical challenges of implementing evaluation metrics for LLM outputs. Currently, I am preparing to apply these metrics and codes to actual model runs.

This blog series documents that journey. It covers the fundamental concepts of drug development, representative Activity Cliff cases, hallucination mitigation strategies in LLMs, and a review of existing evaluation frameworks such as OpenAI HealthBench and PharmaSwarm. It also includes experiments with RAG and RAGAS, concluding with a discussion on applying and tuning evaluation metrics in practice.


[Series Outline]

  1. Drug Development, RDKit, SMILES Code, and Fingerprints
  2. A Representative Activity Cliff Case – (R)-thalidomide vs. (S)-thalidomide
  3. Reducing LLM Hallucinations – Few Shot, Qdrant, and Prompt Engineering