sources -
https://www.aleksagordic.com/blog/vllm
https://medium.com/@rubihali/inference-engines-backbone-of-llm-3149623ece55
https://ranjankumar.in/large-language-models-llms-inference-and-serving#mixture-of-experts-moe
https://www.youtube.com/watch?v=9tvJ_GYJA-o
https://hamzaelshafie.bearblog.dev/paged-attention-from-first-principles-a-view-inside-vllm/
https://www.youtube.com/watch?v=3SBUCJzogj4&list=PLuOtX1AplgcvdjCK-tk4TPiYOGsKkil6J&index=5&t=12s
https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
https://www.youtube.com/watch?v=hMs8VNRy5Ys
prompt caching - https://sankalp.bearblog.dev/how-prompt-caching-works/#llm-inference-basics
My conversation with Claude on Prefill vs Decoding
https://claude.ai/share/34be7887-720f-4ab4-8a0e-0b8fc51478e6
actual numbers when a 50K token request is sent -