https://github.com/wannabeyourfriend/weak2strong-generalization
| Model | Score | max_completion_token | temperature | test_size |
|---|---|---|---|---|
| gpt-4o-mini | 0.45 | 512 | 0 | 500 |
| gpt-4o | 0.48 | 512 | 0 | 500 |
| gpt-4.1-nano | 0.50 | 512 | 0 | 500 |
| gpt-4.1-mini | 0.54 | 512 | 0 | 500 |
| gpt-4.1 | 0.57 | 512 | 0 | 500 |
| gpt-5-nano | 0.78 | 2048 | 1 | 100 |
| gpt-5-mini | 0.84 | 2048 | 1 | 100 |
| gpt-5 | 0.78 | 2048 | 1 | 100 |
| o3-mini | 0.76 | 2048 | 1 | 100 |
| o3 | 0.78 | 2048 | 1 | 100 |
| o4-mini | 0.84 | 2048 | 1 | 100 |
few-shot examples are generated under temp=1, samples from training set (a 500 subset), max_completion_token is aligned with the main exp setting
Setting: gpt-4o-mini → gpt-5-mini, temperature=1, max_completion_token=512, test_size=200
| Few shot nums | 0 | 1 | 2 | 5 | 10 | 20 |
|---|---|---|---|---|---|---|
| Weak + Gold | 0.455 | 0.520 | 0.430 | 0.325 | 0.210 | 0.195 |
| Strong + Weak | 0.570 | 0.595 | 0.585 | 0.615 | 0.650 | 0.660 |
| Strong + Gold | 0.555 | 0.565 | 0.545 | 0.640 | 0.655 | 0.700 |
| PGR | N/A | 1.667 | 1.348 | 0.921 | 0.989 | 0.921 |

Setting: gpt-4o-mini → gpt-5-mini, temperature=1, max_completion_token=2048, test_size=200
| Few shot nums | 0 | 1 | 2 | 5 | 10 | 20 |
|---|---|---|---|---|---|---|
| Weak + Gold | 0.67 | 0.635 | 0.685 | 0.475 | 0.350 | 0.270 |
| Strong + Weak | 0.82 | 0.845 | 0.875 | 0.855 | 0.880 | 0.910 |
| Strong + Gold | 0.84 | 0.870 | 0.855 | 0.910 | 0.880 | 0.895 |
| PGR | N/A | 0.894 | 1.118 | 0.874 | 1.00 | 1.024 |

Setting: gpt-4o-mini → gpt-4.1-mini, temperature=1, max_completion_token=512, test_size=200