Overview
This repository has a benchmarking system for evaluating multiple AI models on medical coding examinations. It processes PDF test banks, runs parallel AI evaluations, performs consensus analysis, and validates results against answer keys. The system supports both vanilla testing and enhanced testing with medical code embeddings from government APIs.
Workflow
- PDF Extraction: Extract questions from PDF test banks using
pdf_parser.py
- Medical Code Enhancement: Fetch real medical code descriptions from government APIs
- Parallel Testing: Run multiple AI models concurrently on the test questions
- Consensus Analysis: Perform multi-round voting to establish consensus answers
- Validation: Compare consensus results against the official answer key
Supported AI Models
- OpenAI: GPT-4.1, GPT-4o, GPT-4.1-mini, GPT-4o-mini
- Anthropic: Claude Sonnet 3.5, Claude Sonnet 3.7, Claude Sonnet 4
- Google: Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.5 Flash Preview
- Mistral: Mistral Medium
- DeepSeek: DeepSeek v3
- xAI: Grok 4
Parallelism
- Question-level: Process multiple questions concurrently per model
- Model-level: Test multiple AI models simultaneously
- Configurable workers: Adjust parallelism via
--workers
and --max-concurrent-agents
Consensus Strategy