結論

Papers with Code - GSM8K Benchmark (Arithmetic Reasoning)

Untitled

WizardMath 7B v1.1は, 英文のGSM8Kならば54.9で答えるが、日本語になると18.4まで落ちる

Untitled

https://arxiv.org/pdf/2308.09583v1.pdf

Evolutionary Optimization of Model Merging Recipes

greedy majority@50
model GSM8K MATH GMS8K MATH
OpenMath-CodeLlama-7B (nemo HF) 75.9 43.6 84.8
OpenMath-Mistral-7B (nemo HF) 80.2 44.5 86.9
OpenMath-CodeLlama-13B (nemo HF) 78.8 45.5 86.8
OpenMath-CodeLlama-34B (nemo HF) 80.7 48.3 88.0
OpenMath-Llama2-70B (nemo HF) 84.7 46.3 90.1
OpenMath-CodeLlama-70B (nemo HF) 84.6 50.7 90.8

Datasets