Dataset: MATH500 (the same as Let’s Verify Step by Step)
https://embed.notionlytics.com/wt/ZXlKM2IzSnJjM0JoWTJWVWNtRmphMlZ5U1dRaU9pSmxOV0p2TldRd1YyWnNTSGhVUW1kelUyUkVPQ0lzSW5CaFoyVkpaQ0k2SWpWak5EUTBNelpoTW1Oa05qUXpZak00TVdVM05EUXlOMlUzWmpkaU1UUm1JbjA9
Models:
model_seqs = [
[
"deepseek-ai/deepseek-math-7b-base",
"deepseek-ai/deepseek-math-7b-instruct",
"deepseek-ai/deepseek-math-7b-rl",
],
[
"mistralai/Mistral-7B-v0.1",
"peiyi9979/mistral-7b-sft",
"peiyi9979/math-shepherd-mistral-7b-rl",
],
]
ensemble_type
meaning:
avg
: mean value of pass ratios (#correct samples/#all samples for the same problem) on all test samples (similar to Pass@1)maj
: pass rates by majority votingany
: common Pass@k<aside> 🔑 Sampling 64 samples per prompt in MATH500 with DeepSeekMath-7B-RL using vLLM on a A800-PCIe(80GB) takes ~1.5hr (170 ms/sample)
</aside>
Dataset: MATH500
Framework: vLLM
Device: A800-PCIe(80GB) * 1
#Shots: 1 for base models, 0 for instruction-tuned models
Parameters:
max_new_toks: 2048
gpu_mem_util: 0.85
temperature: 0.7