Raw Exps’ Data

The reason for adjusting max token is that under the 512 setting, many failures occur
Observed that when temp=1, scores fluctuate by about 2% (test_size=200)
A exp: weak → strong experiments: gpt-4o-mini → gpt-5-mini (blue → orange)
B exp: weak → strong experiments: gpt-4o-mini → gpt-4.1-mini (blue → green)

Model	Score	max_completion_token	temperature	test_size
gpt-4o-mini	0.45	512	0	500
gpt-4o	0.48	512	0	500
gpt-4.1-nano	0.50	512	0	500
gpt-4.1-mini	0.54	512	0	500
gpt-4.1	0.57	512	0	500
gpt-5-nano	0.78	2048	1	100
gpt-5-mini	0.84	2048	1	100
gpt-5	0.78	2048	1	100
o3-mini	0.76	2048	1	100
o3	0.78	2048	1	100
o4-mini	0.84	2048	1	100

few-shot examples are generated under temp=1, samples from training set (a 500 subset), max_completion_token is aligned with the main exp setting

Setting: gpt-4o-mini → gpt-5-mini, temperature=1, max_completion_token=512, test_size=200

Few shot nums	0	1	2	5	10	20
Weak + Gold	0.455	0.520	0.430	0.325	0.210	0.195
Strong + Weak	0.570	0.595	0.585	0.615	0.650	0.660
Strong + Gold	0.555	0.565	0.545	0.640	0.655	0.700
PGR	N/A	1.667	1.348	0.921	0.989	0.921

Setting: gpt-4o-mini → gpt-5-mini, temperature=1, max_completion_token=2048, test_size=200

Few shot nums	0	1	2	5	10	20
Weak + Gold	0.67	0.635	0.685	0.475	0.350	0.270
Strong + Weak	0.82	0.845	0.875	0.855	0.880	0.910
Strong + Gold	0.84	0.870	0.855	0.910	0.880	0.895
PGR	N/A	0.894	1.118	0.874	1.00	1.024

Setting: gpt-4o-mini → gpt-4.1-mini, temperature=1, max_completion_token=512, test_size=200