The Stack Monoid | Hacker News
This is a bit of a followup to Towards GPGPU JSON parsing. That proposed a rather roundabout way to parallelize a simple parsing task. Having had more GPU programming experience under my belt, I don’t expect that particular approach to work well, but it did suggest that parallelism exists in the problem.
This post is a writeup of a new idea, but with a caution, no implementation. It probably contains some mistakes, and maybe the idea is flawed. But if it holds up, I think it’s an exciting line of research on how to port sequential algorithms to GPU.
For this post, I’m going to pose an even more simplified version of the problem: for each open bracket, record the index of the parent in the parse tree, and for each close bracket, the index of the corresponding open bracket.
This is a simple sequential program:
stack = [None]
for i, token in input:
result[i] = stack[len(stack) - 1]
if token == '[':
stack.push(i)
elif token == ']':
stack.pop()
To follow the running example from the JSON post, assume the input [[][[][][[]]][][]]. The result is:
index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
input [ [ ] [ [ ] [ ] [ [ ] ] ] [ ] [ ] ]
value - 0 1 0 3 4 3 6 3 8 9 8 3 0 13 0 15 0
Can it be parallelized? It’s challenging to see how, as there are dependencies on previous state. Further, the state itself has unbounded size. It can be O(n); a pathological case is a million open brackets followed by a million close. In practice, depending on the workloads, we might expect more modest nesting depth, but ideally we’d like an algorithm that can handle all cases at least reasonably well.
Maybe we can do something. To longtime readers of my blog (and the “rope science” series before that), it will be no surprise I’ll try to use monoids.
So let’s turn this sequential program into a monoid. If I had a time portal, I’d lens forward in time and use the handy automated tool that some enterprising PhD student will has done by then. But, failing that, I’ll do it in my head.
The monoid is basically a sequence of pops followed by a sequence of pushes. Since the pop operations (in this case) don’t have a payload, they can be represented simply as a count. So the monoid is the pair (n_pops, elements). The primitive for push is (0, [element]), and for pop it’s (1, []). The monoid operation is:
def combine(m1, m2):
n1, s1 = m1
n2, s2 = m2
if n2 >= len(s1):
return (n1 + n2 - len(s1), s2)
else:
return (n1, s1[:len(s1) - n2] + s2)
Running scan, or generalized prefix sum, on this monoid, will result in the desired result at the top of the stack, i.e. the last element of the sequence in the monoid.
Exercise: Show that this monoid is associative.
I do believe that an automated tool for translating these kinds of simple sequential programs is possible, and that there’s a theory behind it that illuminates ways in which monoids do and do not compose. Some composition is possible, for example you could fairly easily extend this idea so that each child knows its sequence number relative to its siblings. To do that, blend in the “counting monoid,” which is just integer addition.
Exercise: Extend the monoid to handle sequence numbers.
If we had a bound on stack depth, we’d be pretty well set. The problem is the unbounded case. My original thinking was to have a window of (let’s say size k) elements, and each pass would handle a slice of the stack of size k. I think this can be made to work, but at heart it requires a number of passes proportional to stack depth, so in the worst case O(n^2).