6.LLM benchmark

useful reference for understanding model’s capabilities

types:

Quality benchmark- measure how well a model answer question, reasons, or follow instruction, ex: MMLU, GSM8K, HumanEval and TruthfulQA
Performance benchmark- how fast and effiveintly model runs in real world env

both benchmarks matter, but they serve diff purposes.

When you should run performance benchmark?

comparing diff models
evaluating inference framework - when comparing framework like vLLM, SGLang, TensorRT-LLM and HF TGI
Testing infrastructure changes- moving from A10G to H100 GPU , or swithcing from on prem to cloud
measuring optimization gain- inference techniques such as speculative decoding, prefix caching, disaggregated serving, or KV cache offloading should always be validated with repeatable performance testing
scaling for prod traffic- before going live ,benchmarks under realistic request rates and concurrency levels show how your system holds up

General load testing tools-

Locust and K6 used for simulating real-world traffic, focus on load testing: generating large numbers of concurrent requests to see how your LLM deployment performs

Specialized benchmarking tools-

NVIDIA GenAI-Perf and LLMPerf for LLM performance benchmarking, focus on inference level metrics such as throughput and latency.

Framework specific -

vLLM and SGLang offer their own benchmarking scripts, commands, and usage guidelines. They are helpful for quick experiment

end to end benchmarking with llm optimzer

llm-optimizer is an open-source tool for benchmarking and optimizing LLM inference performance. ,evaluates how an LLM behaves across different server parameters