Time: UTC+8 2025/10/20 12:30-13:30

How do we meet?

I found him on Zhihu.com, where he shared his experience in oneDNN Graph Compiler project. At that time he is an Engineer in Intel Shanghai. In Zhihu he introduced how to use their compiler, their experience, and bottleneck. We found similar experience like "too many barriers", "thread load imbalance" in serving LLM in CPU. So I just send him an invitation, and add his WeChat. He likes my blogs "Matmul in C/Java, a comparsion". (I should post more!)

Main Content

I didn’t record the meeting, but I remember 95% of the content in the meeting. The chat was in Chinese, and I just translated our talking in my own words.

Haibin: Hello senpai ! Very happy to know you! May you introduce what did you do on CPU Graph Compiler?

Senpai: Yeah very happy to chat with you too. So we built a CPU Graph Compiler for AI Models. Basically the compiler will accept a kind of IR from oneDNN (Somehow for political issues, we didn't use MLIR), then it generates high performance kernels for the models. In generation phases, we introduced several optimizations. For example, grouping the barriers, better AMX instructions kernels.

Now let me introduced these optimizations:

  1. Grouping the barriers. Imagine that we have 56 cores in the machine like Sapphire. We can group them into 7 groups, each groups have 8 threads. For each group, they will compute for one tile from the big matrix. After finishing the computation, they will synchronize with threads in the same group. So original Time is like t = max(thread_time), but now for some groups they can compute more tiles.

image.png

  1. Prefetching: when a thread inside a group finished its own jobs, while others don't, we can let them to prefetch the next jobs for next round!

image.png

  1. Thread pool management. We implement our barrier using spinlock. This is because spinlock is more lightweight. And we also use atmoic instructions. The main backend like tbb or openmp's jobs are only for opening up the multi-threads.
  2. Better kernel. So we found that for matmul, most of the kernels try to spilt the M and N dimensions, while we found that spilt the K dimension sometimes works better. But it will be more complicated since K axis needs reductions. So maybe for your llama.cpp, is it good enough for a good matmul kernel? (Me: no they are not the best)
  3. AMX Instructions. We spent a lot of effort on AMX, which is like tensorcores in CPU. And we observed that at somepoint, the whole workloads become memory bound instead of compute bound. But somehow, AMX can't save CPU & intel. I will explain this later.
  4. Task Reorder: we do this in compiler level. So the compiler will analyze the dependency of the graph, and automatically generate kernels. But here the point is, since we are using fork-join, so all the task will be scheduled statically. That is, when the compiler running complete, each thread will exactly knows which jobs they are going to run. Compare to Taskflow, Taskflow threads won't know which jobs they are running until they got running and getting jobs from central scheduler.