結論
Papers with Code - GSM8K Benchmark (Arithmetic Reasoning)
WizardMath 7B v1.1は, 英文のGSM8Kならば54.9で答えるが、日本語になると18.4まで落ちる
https://arxiv.org/pdf/2308.09583v1.pdf
Evolutionary Optimization of Model Merging Recipes
greedy | majority@50 | |||
---|---|---|---|---|
model | GSM8K | MATH | GMS8K | MATH |
OpenMath-CodeLlama-7B (https://huggingface.co/nvidia/OpenMath-CodeLlama-7b-Python | https://huggingface.co/nvidia/OpenMath-CodeLlama-7b-Python-hf) | 75.9 | 43.6 | 84.8 |
OpenMath-Mistral-7B (https://huggingface.co/nvidia/OpenMath-Mistral-7B-v0.1 | https://huggingface.co/nvidia/OpenMath-Mistral-7B-v0.1-hf) | 80.2 | 44.5 | 86.9 |
OpenMath-CodeLlama-13B (https://huggingface.co/nvidia/OpenMath-CodeLlama-13b-Python | https://huggingface.co/nvidia/OpenMath-CodeLlama-13b-Python-hf) | 78.8 | 45.5 | 86.8 |
OpenMath-CodeLlama-34B (https://huggingface.co/nvidia/OpenMath-CodeLlama-34b-Python | https://huggingface.co/nvidia/OpenMath-CodeLlama-34b-Python-hf) | 80.7 | 48.3 | 88.0 |
OpenMath-Llama2-70B (https://huggingface.co/nvidia/OpenMath-Llama-2-70b | https://huggingface.co/nvidia/OpenMath-Llama-2-70b-hf) | 84.7 | 46.3 | 90.1 |
OpenMath-CodeLlama-70B (https://huggingface.co/nvidia/OpenMath-CodeLlama-70b-Python | https://huggingface.co/nvidia/OpenMath-CodeLlama-70b-Python-hf) | 84.6 | 50.7 | 90.8 |