LLM Inference Engines and Optimization | Notion

sources -

https://www.aleksagordic.com/blog/vllm

https://medium.com/@rubihali/inference-engines-backbone-of-llm-3149623ece55

https://ranjankumar.in/large-language-models-llms-inference-and-serving#mixture-of-experts-moe

https://www.youtube.com/watch?v=9tvJ_GYJA-o

https://hamzaelshafie.bearblog.dev/paged-attention-from-first-principles-a-view-inside-vllm/

https://www.youtube.com/watch?v=3SBUCJzogj4&list=PLuOtX1AplgcvdjCK-tk4TPiYOGsKkil6J&index=5&t=12s

https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/

https://www.youtube.com/watch?v=hMs8VNRy5Ys

prompt caching - https://sankalp.bearblog.dev/how-prompt-caching-works/#llm-inference-basics

What does they do - inferencing efficiency - to serve to a lot of users effectively, optimized resource management
interesting concept - schedular for an inferencing engine - Long-context requests are "toxic" to batches. They slow everyone down without benefiting themselves.

My conversation with Claude on Prefill vs Decoding

https://claude.ai/share/34be7887-720f-4ab4-8a0e-0b8fc51478e6

actual numbers when a 50K token request is sent -