instead of processing each request individually, batching them together allows to use same model params across multiple request, improving throughput

types of batching:

image.png

leads to waste compute resources and increased latency

ensures early request are not delayed indefinitely by later ones

does not achieve max GPU efficiency

image.png

Paged Attention-

KV cache takes a big chunk of memory ,stored as one giant continuous block(lead to memory fragmentation or wasted spaces)

to avoid this paged attention break that big chunk into smaller blocks,

means kv stored in non continuous block, it then uses lookup table to keep track of these blocks, the LLM loads only the blocks it needs, instead of loading everything at once.

this saves memory and makes the whole process more efficient, it even allows the same block to be shared across diff outputs if needed.