結論

Papers with Code - GSM8K Benchmark (Arithmetic Reasoning)

Untitled

WizardMath 7B v1.1は, 英文のGSM8Kならば54.9で答えるが、日本語になると18.4まで落ちる

Untitled

https://arxiv.org/pdf/2308.09583v1.pdf

Evolutionary Optimization of Model Merging Recipes

greedy majority@50
model GSM8K MATH GMS8K MATH
OpenMath-CodeLlama-7B (https://huggingface.co/nvidia/OpenMath-CodeLlama-7b-Python https://huggingface.co/nvidia/OpenMath-CodeLlama-7b-Python-hf) 75.9 43.6 84.8
OpenMath-Mistral-7B (https://huggingface.co/nvidia/OpenMath-Mistral-7B-v0.1 https://huggingface.co/nvidia/OpenMath-Mistral-7B-v0.1-hf) 80.2 44.5 86.9
OpenMath-CodeLlama-13B (https://huggingface.co/nvidia/OpenMath-CodeLlama-13b-Python https://huggingface.co/nvidia/OpenMath-CodeLlama-13b-Python-hf) 78.8 45.5 86.8
OpenMath-CodeLlama-34B (https://huggingface.co/nvidia/OpenMath-CodeLlama-34b-Python https://huggingface.co/nvidia/OpenMath-CodeLlama-34b-Python-hf) 80.7 48.3 88.0
OpenMath-Llama2-70B (https://huggingface.co/nvidia/OpenMath-Llama-2-70b https://huggingface.co/nvidia/OpenMath-Llama-2-70b-hf) 84.7 46.3 90.1
OpenMath-CodeLlama-70B (https://huggingface.co/nvidia/OpenMath-CodeLlama-70b-Python https://huggingface.co/nvidia/OpenMath-CodeLlama-70b-Python-hf) 84.6 50.7 90.8

Datasets