Speed test

Quant Scheme Observer QuantizationModifier GPTQModifier
fp8_dynamic_per_token MinMax 0.753–0.754
MSE 0.759-0.760
fp8_static_per_tensor MinMax 0.757–0.758
MSE 0.767-0.770
int8_w8a8_dynamic_per_token MinMax 0.760–0.761 0.769–0.771
MSE 0.770–0.772 0.767-0.767
w4a16_actorder_group MinMax 0.726-0.726
MSE 0.712-0.712
w4a16_actorder_weights MinMax 0.721-0.722
MSE 0.717-0.720
w4a16_grouped_quant MinMax 0.666–0.671 0.717-718
MSE 0.657–0.659 0.723-0.724

AWQ results

MinMax:

Task Version Filter n-shot Metric Value
wikitext 2 none 5 bits_per_byte 0.6291
5 byte_perplexity 1.5466
5 word_perplexity 10.2949

MSE:

Task Version Filter n-shot Metric Value
wikitext 2 none 5 bits_per_byte 0.6323
none 5 byte_perplexity 1.5500
none 5 word_perplexity 10.4192

MSE Observer(0.2 max shrink)

Quant Scheme Observer QuantizationModifier GPTQModifier
fp8_dynamic_per_token MinMax 0.753–0.754
MSE 0.759-0.760
fp8_static_per_tensor MinMax 0.757–0.758
MSE 0.770-0.770
int8_w8a8_dynamic_per_token MinMax 0.760–0.761 0.769–0.771
MSE 0.764-0.767
vl_fp8_dynamic_per_token MSE 0.833
vl_w4a16_actorder_weight MSE 0.867
w4a16_actorder_group MinMax 0.726-0.726
MSE 0.731-0.731
w4a16_actorder_weights MinMax 0.721-0.722
MSE 0.724-0.726
w4a16_grouped_quant MinMax 0.666–0.671 0.717-718
MSE 0.726-0.727

Time Sheets

meta-llama/Meta-Llama-3-8B-Instruct

MinMax:

Step Time (seconds)
_load_model_and_processor 5.772182941436768
_calibrate 251.95170068740845
_run_oneshot 252.93479776382446
_save_compressed_model 41.454792976379395
_handle_recipe 0.002226591110229492
_run_lm_eval 1196.4064140319824

MSE: