Overview

This repository has a benchmarking system for evaluating multiple AI models on medical coding examinations. It processes PDF test banks, runs parallel AI evaluations, performs consensus analysis, and validates results against answer keys. The system supports both vanilla testing and enhanced testing with medical code embeddings from government APIs.

Workflow

Supported AI Models

Parallelism

Consensus Strategy