요약

problem1 : edge device 같은 경우 작은 memory size 때문에 돌릴 수 있는 NN이 제한 된다.
problem2 : alloc, de-alloc 등 memory 관련 op들이 꽤 큰 latency를 차지한다.

memory copy는 asynchronous copy로 가려질 수 있지만, memory alloc/de-alloc은 synchronization primitive 때문에
design : occamy (3 steps)
- Liveness-aware Memory Operation Insertion
  - Insert GPU memory operations (malloc, memcpy, dealloc) into DNN IR.
  - Use this data to build a liveness table per tensor.
  - 필요한 tensor 를 하나 하나 alloc하고 de-alloc하는 방식
  - Eager mode : 미리 alloc해서 할당해둠. (이러면 동 time에 여러 tensor들이 메모리를 차지하고 있어서 footprint를 많이 못 가져감)
  - Lazy memory : 그때 그때 필요할 때 alloc하고 de-alloc 함. 이러면 OS system call, Mutex로 인한 atomic access 때문에 runtime이 느려짐.
  - 즉, eager vs. lazy는 footprint-runtime trade-off 관계임.
  - 본 논문에서는 liveness table을 만들어서, A와 B가 서로 다른 time에 필요하면, A를 위해 alloc한 공간을 B을 위해서도 쓰는 방식.
    - (기존) A alloc → A 사용 → A de-alloc → B alloc → B 사용 → B de-alloc
    - (제안) A alloc → A 사용 → B 사용 → A de-alloc
- Use this data to build a liveness table per tensor.
  - Layer Fusion : Merge compatible operations (e.g.)
```
Conv + ReLU → Conv-ReLU
```
  - Tensor Coalescing: Reuse memory between input and output tensors for elementwise ops (e.g.,)
```
Add
```
  - avoids extra memory for temporary outputs.
  - → Fewer memory allocations, more memory reuse.
- Memory Pool Code Generation :Generates instructions to emulate mallocs within the pool. → Eliminates need for dynamic allocation or deallocation calls at runtime.
```
DNN.Mem-offset(base, offset, size)
```
implementation
- Built on MLIR by extending ONNX-MLIR (originally CPU-only).
- Added GPU support with CUDA backend.
- Compiles ONNX models into LLVM IR with memory pool logic.

ONNX IR -> DNN IR -> LLVM IR -> executible binary

ONNX , DNN IR level에서 liveness table을 유지하고, DNN IR level에서 layer fusion/Tensor coaleascing을 유지하고, DNN IR level에서 memory pool 관련된 (de)alloc도 일원화해서 마지막으로 LLVM IR로 간다. 물론 MLIR 특징 처럼 DNN IR 단계에서 ONNX IR도 일부 존재한다. 이후, DNN.Conv 등의 IR은 cuDNN kerenl call로 대체하고, link 걸어서 GPU 를 enable 한다.

implemetation tools (나의 추측)

Step	Verdict	Notes
ONNX-MLIR frontend	✅ ✔️	Used ONNX dialect as input
ONNX → DNN IR	✅ ✔️	Via MLIR rewrite passes
Optimization passes	✅ ✔️	Custom MLIR passes for fusion & pooling
DNN IR → LLVM IR	✅ ✔️	Using MLIR pattern rewrites
Linking with cuDNN	🔶 Mostly right	Probably used `clang -lcudnn`, not `llc` directly
Executable binary	✅ ✔️	Result is a GPU-inference binary

implementation details

MLIR level에서 모든 optimization 작업을 하고, 새로운 instruction들도 집어 넣는다. 하지만 마지막에 LLVM IR로 가서 ptx로 code-gen 할 수 있게끔 한다.

ONNX IR → DNN IR (MLIR) → LLVM IR → CUDA Runtime API → Executable

MLIR에서는 dialect 하나를 추가했다.