7. batching | Notion

instead of processing each request individually, batching them together allows to use same model params across multiple request, improving throughput

types of batching:

static batching: simplest form, here the server waits until a fixed no. of request arrives anf then processes them together as single batch.

leads to waste compute resources and increased latency

Dynamic batching- to fix issue it works same as static but it sets a time window and process requests have arrived in that time frame.

ensures early request are not delayed indefinitely by later ones

does not achieve max GPU efficiency

continuous batching -

Paged Attention-

KV cache takes a big chunk of memory ,stored as one giant continuous block(lead to memory fragmentation or wasted spaces)

to avoid this paged attention break that big chunk into smaller blocks,

means kv stored in non continuous block, it then uses lookup table to keep track of these blocks, the LLM loads only the blocks it needs, instead of loading everything at once.

this saves memory and makes the whole process more efficient, it even allows the same block to be shared across diff outputs if needed.