結論
Papers with Code - GSM8K Benchmark (Arithmetic Reasoning)

WizardMath 7B v1.1は, 英文のGSM8Kならば54.9で答えるが、日本語になると18.4まで落ちる

https://arxiv.org/pdf/2308.09583v1.pdf
Evolutionary Optimization of Model Merging Recipes
| greedy | majority@50 | |||
|---|---|---|---|---|
| model | GSM8K | MATH | GMS8K | MATH |
| OpenMath-CodeLlama-7B (nemo | HF) | 75.9 | 43.6 | 84.8 |
| OpenMath-Mistral-7B (nemo | HF) | 80.2 | 44.5 | 86.9 |
| OpenMath-CodeLlama-13B (nemo | HF) | 78.8 | 45.5 | 86.8 |
| OpenMath-CodeLlama-34B (nemo | HF) | 80.7 | 48.3 | 88.0 |
| OpenMath-Llama2-70B (nemo | HF) | 84.7 | 46.3 | 90.1 |
| OpenMath-CodeLlama-70B (nemo | HF) | 84.6 | 50.7 | 90.8 |