This is a followup to my previous post on the stack monoid, but is intended to be self-contained.
GPUs are well known for being efficient on array-structured data, where it is possible to operate on the elements of the array in parallel. That last restriction doesn’t mean that the operations have to be completely independent; it’s also well known that GPUs are good at algorithms like prefix sum, where a simplistic approach would imply a sequential data dependency, but sophisticated algorithms can exploit parallelism inherent in the problem.
In piet-gpu, I have a particular desire to work with data that is fundamentally tree-structured: the scene description. In particular, there are clip nodes, described by a clip path, and whose children are masked by that clip path before being composited. The level of nesting is not limited in advance. In SVG, this is represented by the clipPath element, and the tree structure is explicit in the XML structure. All other modern 2D graphics APIs and formats have a similiar capability. For Web canvas, the corresponding method is clip, but in this case the tree structure is represented by the nesting of save and restore calls.
One of the design principles of piet-gpu is that the scene is presented as a sequence of elements. The tree structure is represented by matching begin_clip and end_clip elements. The problem of determining the bounding box of each element can be expressed roughly as this pseudocode:
stack = [viewport.bbox]
result = []
for element in scene:
match element:
begin_clip(path) => stack.push(intersect(path.bbox, stack.last()))
end_clip => stack.pop()
_ => pass
result.push(intersect(element.bbox, stack.last()))
Basically, this represents the fact that each drawing element is clipped by all the clip paths in the path up to the root of the tree. Having precise bounding boxes is essential for the performance and correctness of later stages in the pipeline. There’s a bit more discussion in clipping in piet-gpu (an issue in this blog that hopefully will become a post soon).
The computation of bounding box intersections is not particularly tricky. If it were just a sequence of pushes and no pops, it could easily be modelled by a scan (generalized prefix sum) operation over the “rectangle intersection” monoid. The tricky part is the stack, which is basically unbounded in size. In my previous post, I had some ideas how to handle that, as the stack is a monoid, but the proposal wasn’t all that practical. For one, it required additional scratch space that was difficult to bound in advance.
Since then, I’ve figured out a better algorithm. In fact, I think it’s a really good algorithm, and might even be considered to push the boundary about what’s possible to implement reasonably efficiently on GPU.
For the remainder of this post, I’ll drop the bounding boxes and just consider the stack structure; those can be added back in later without any serious difficulty. This is known in the literature as the “parentheses matching problem” and has other names. I prefer “stack monoid” because identifying it as a monoid is a clear guide that parallel (or incremental) algorithms are possible, but I’m not sure the term has caught on. Another reason I prefer the “monoid” terminology is that it emphasizes that the pure stack monoid can be composed with other monoids such as axis-aligned rectangle intersection; conceptually I think of the data type implemented by this algorithm as Stack<T> where T: Monoid.
The simplified problem can be stated as:
stack = []
for i in 0..len(input):
match input[i]:
'(' => stack.push(i)
')' => stack.pop()
result.push(stack.last())
Here, every open parenthesis is assigned the index of its parent in the tree, and every close parenthesis is assigned the index of its corresponding open parenthesis.
An example:
index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
input ( ( ( ) ( ( ( ) ) ( ( ) ( ) ) ) ) )
value - 0 1 2 1 4 5 6 5 4 9 10 9 12 9 4 1 0
It might be easier to visualize in tree form:
An interesting fact is that from this it’s quite straightforward to reconstruct the snapshot of the stack at each element, just by walking up the chain of parent links to the root. And, importantly, this “all snapshots” representation takes no more space than the stack itself (assuming the input has no additional bounds on nesting depth). This sequence-of-snapshots can be considered an immutable data structure which is gradually resolved and revealed by the algorithm, as opposed to the stack itself, which is both mutable and requires O(n) space for each instance, a serious problem for efficient GPU implementation.
In case this “monoid” language is too confusing or abstract, there’s a more concrete way to understand this, which will also be useful in explaining the full GPU algorithm. Consider the snapshot of the stack at each iteration of the above simple loop. Then it turns out to be fairly straightforward to express a command to go from the stack at step i to the stack at step j: pop some number of elements, then push a sequence of some new ones.