Overview

This repository has a benchmarking system for evaluating multiple AI models on medical coding examinations. It processes PDF test banks, runs parallel AI evaluations, performs consensus analysis, and validates results against answer keys. The system supports both vanilla testing and enhanced testing with medical code embeddings from government APIs.

Workflow

PDF Extraction: Extract questions from PDF test banks using pdf_parser.py
Medical Code Enhancement: Fetch real medical code descriptions from government APIs
Parallel Testing: Run multiple AI models concurrently on the test questions
Consensus Analysis: Perform multi-round voting to establish consensus answers
Validation: Compare consensus results against the official answer key

Supported AI Models

OpenAI: GPT-4.1, GPT-4o, GPT-4.1-mini, GPT-4o-mini
Anthropic: Claude Sonnet 3.5, Claude Sonnet 3.7, Claude Sonnet 4
Google: Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.5 Flash Preview
Mistral: Mistral Medium
DeepSeek: DeepSeek v3
xAI: Grok 4

Parallelism

Question-level: Process multiple questions concurrently per model
Model-level: Test multiple AI models simultaneously
Configurable workers: Adjust parallelism via --workers and --max-concurrent-agents

Overview

Workflow

Supported AI Models

Parallelism

Consensus Strategy