Author: https://twitter.com/ldl19691031

In this article, I try to analyze the detailed culling process of Nanite. If you are not familiar with how Nanite works, please check these 2 articles I wrote before :

The whole pipeline of UE5: https://www.notion.so/Brief-Analysis-of-UE5-Rendering-Pipeline-feedcb9174aa4af2af936fbb02a9e390

The brief analysis of Nanite: https://www.notion.so/Brief-Analysis-of-Nanite-94be60f292434ba3ae62fa4bcf7d9379

Warning: I'm not a developer of Epic Games. This article may contain mistakes although I will try to fix the mistakes as soon as possible. If I mislead you, please forgive me.

And, this article contains many implementation details. If you do not have interest of the detail implementation, I think you can skip this one.

Before We Start

As we discussed in 'The brief analysis of Nanite', the Nanite will create a tree contains two different kinds of nodes: HierarchyNode, which contains a set of children, and the cluster nodes, which contains cluster parts. They also have their own bounds for culling. This article focus on HOW the culling has been executed instead of WHY Nanite does like this. If you want to know more about the reasons, please read 'The brief analysis of Nanite'.

The culling system contains two-level: instance level and persistent level. Since there are many other articles talked about instance culling both in Unreal Engine and in other engines, This article focus on the second one, about how to cull the small parts of a high poly Nanite mesh.

Nanite highly uses the compute shader's group share memory systems, so I will explain the GPU memory model a little. If you already know, just skip the following part.

GPU Execution Model

A compute shader can execute on many threads in parallel, within a thread group.

A set of threads are grouped and will be scheduled to run in parallel with the same instruction each time, which is a 'Single instruction, multiple threads' model.

So, there will be three kinds of memory that each thread can access: the global memory which is available for all threads in all groups; group shared memory, which is accessible for the threads in the same group, but not the same for different groups; and the local memory for each thread individually.

There are many better explanations of compute shader and GPU execution model, so if you want to read more, please check these:

The life of triangle :https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline

Compute Shader: https://docs.microsoft.com/en-us/windows/win32/direct3d11/direct3d-11-advanced-stages-compute-shader

The Problem

Nanite need to solve a problem: how to culling a tree structure parallelly and as fast as possible?

If we map all the threads to all the tree nodes, of course, they are paralleled but also there are many threads that are wasted: if a parent node has been culled, no need to check the child.

But in the other hand, if we dynamiclly map threads to current levels nodes, then we need to keep allocating new threads because each level's nodes are in different number.

So, Nanite build a Multi-Producer-Multi-Customer model as an answer to the problem.

Instead of spawning new threads, new jobs are added to the queue to be consumed by workers. Initially, the queue is populated with work generated by preceding shader passes. When the persistent shader is running, its workers will continuously consuming items from the queue, but also produce new items.