Yao Fu | Website | Blog | Twitter / X

University of Edinburgh

https://embed.notionlytics.com/wt/ZXlKd1lXZGxTV1FpT2lKaU5UTTJZek5rTmpreE1qRTBPV0V6T1RVNU16Rm1NV1U0TnpFek56QmtZaUlzSW5kdmNtdHpjR0ZqWlZSeVlXTnJaWEpKWkNJNklrTnlVbFp3WkVOMWEyRnJNblU1U0hWVVdXUjNJbjA9

<aside> 💭 Yao: I want to write a punch line saying deep communication is through writing. Can you think of some sentences?

GPT-4: True depth is found not in speech, but in the quiet dance of pen on paper.

</aside>

Table of Content

Apr 2024 | Llama 3 Opens the Second Chapter of the Game of Scale

Yao Fu. University of Edinburgh

The scaling of text data is likely reaching a ceiling as most of the easy web text (Common Crawl, Github, Arxiv .etc) are now used up. New text data may only incrementally improve model performance because they may not add another order of magnitude. The first chapter of the game of scale, namely scaling up text data, is coming to a conclusion where frontier models are all about GPT-4 parity. Video data can be orders of magnitudes larger than text data. They significantly improves the perception of language models, and opens the possibility of large world models. However, it seems that video data cannot improve reasoning. Reinforcement learning have not yet been scaled, and most existing work only focus on single-step offline optimization. Scaling up the exploration and exploitation with online iterative RL from human, environment, and AI feedback could potentially further improve the model’s reasoning.

Mar 2024 | How Do Language Models put Attention Weights over Long Context?

Yao Fu. University of Edinburgh

We are interested in the problem of lossless KV cache compression: to make the KV cache take less memory without sacrifacing language model’s capability during inference. We tend to view lossless KV cache compression is the number one challenge for democratizing and deployting long-context (100K - 10M) language models in real world.

But sorry, we won’t discuss any techniques related to KV cache compression in this post 😅. Instead, we look at its pre-requisition, i.e., the attention patterns inside the transformer architecture, because only an in-depth understanding of the attention mechanism allows us to find out which KV cache is compressible and which is not.

In this post, we discuss six typical attention patterns over long-context input, across all the transformer layers and heads, aiming to provide an intuitive understanding of what’s happening inside the transformer long-context attention, and potentially identify what part of KV cache is compressible.

Dec 2023 | Towards 100x Speedup: Full Stack Transformer Inference Optimization

Yao Fu. University of Edinburgh

Imagine two companies have equally powerful models. Company A can serve the model to 10 users with 1 GPU, but company B can serve 20 users. Who will win in the long run?

Company B, because its cost is cheaper

Imagine a researcher has come up with a super smart decoding method: clever algorithm, solid math, but not compatible with FlashAttention. Can this method be used in production?

Probably not, because flash attention is essential for large scale model deployment

An in-depth understanding of transformer inference can be extremely beneficiary for both research and production. Yet in real world, large scale production is usually not so close to cutting edge research, such that people know algorithm may not know MLsys, and verse visa.

In this post, we discuss full-stack transformer inference optimization, from hardware specs like A100 memory hierarchy, to MLSys methods like FlashAttention and vLLM, to model architectures like Mixture of Experts, to decoding algorithms like Speculative Decoding and its variants. Like adding buffs in an RPG game, we see how transformer inference is scaled and speed up, step by step.