1 Codebase

https://github.com/wannabeyourfriend/weak2strong-generalization

2 Zero-shot Model Comparison

  1. The reason for adjusting max token is that under the 512 setting, many failures occur
  2. Observed that when temp=1, scores fluctuate by about 2% (test_size=200)
  3. A exp: weak → strong experiments: gpt-4o-mini → gpt-5-mini (blue → orange)
  4. B exp: weak → strong experiments: gpt-4o-mini → gpt-4.1-mini (blue → green)
Model Score max_completion_token temperature test_size
gpt-4o-mini 0.45 512 0 500
gpt-4o 0.48 512 0 500
gpt-4.1-nano 0.50 512 0 500
gpt-4.1-mini 0.54 512 0 500
gpt-4.1 0.57 512 0 500
gpt-5-nano 0.78 2048 1 100
gpt-5-mini 0.84 2048 1 100
gpt-5 0.78 2048 1 100
o3-mini 0.76 2048 1 100
o3 0.78 2048 1 100
o4-mini 0.84 2048 1 100

3 Main Exp results

few-shot examples are generated under temp=1, samples from training set (a 500 subset), max_completion_token is aligned with the main exp setting

A

Setting: gpt-4o-mini → gpt-5-mini, temperature=1, max_completion_token=512, test_size=200

Few shot nums 0 1 2 5 10 20
Weak + Gold 0.455 0.520 0.430 0.325 0.210 0.195
Strong + Weak 0.570 0.595 0.585 0.615 0.650 0.660
Strong + Gold 0.555 0.565 0.545 0.640 0.655 0.700
PGR N/A 1.667 1.348 0.921 0.989 0.921

model_performance_chart_a_1.png

Setting: gpt-4o-mini → gpt-5-mini, temperature=1, max_completion_token=2048, test_size=200

Few shot nums 0 1 2 5 10 20
Weak + Gold 0.67 0.635 0.685 0.475 0.350 0.270
Strong + Weak 0.82 0.845 0.875 0.855 0.880 0.910
Strong + Gold 0.84 0.870 0.855 0.910 0.880 0.895
PGR N/A 0.894 1.118 0.874 1.00 1.024

model_performance_chart_a_2.png

B

Setting: gpt-4o-mini → gpt-4.1-mini, temperature=1, max_completion_token=512, test_size=200