Outline

ASURA is based on some prior work - here are some citations ranked by importance:

In essence: ASURA is an iterative improvement over prior work, wherein it only uses self-attention blocks in a DEQ/Universal Transformer style.

We ditch the “halting mechanism” in UTs as it overcomplicates the architecture and limits scalability. We choose a discrete amount of steps, unlike DEQs and don’t focus on achieving a fixed point, which severely constrains the dynamics of the model and prohibits optimization as you need additional loss terms to ensure stability of the solution.
So effectively, it’s just a standard set of n self-attention blocks applied recursively i times, with a long skip connection borrowed from Bansal et al. that alleviates most of the gradient problems. For this project, i and n will both be constant. I’ve poured (too much) effort into dynamically changing i during training but its hard to optimize.

A core focus of this architecture is simplicity and scalability. Following the bitter lesson: architectures that are scalable often trump those relying on too many priors and inductive biases. Thus, I want to lean towards advertising the architecture’s generality and applicability to various domains.

Another view of UTs is that its simply an aggressive weight sharing scheme across the depth dimension instead of time dimension, like in traditional RNNs. There’s quite a lot of a OOD extrapolation work that mentions how effective weight sharing acts almost like a regularisation. The inductive bias often helps a lot, if not completely, in many situations.

I did a bit of rough exploration (twitter thread) a few months ago in which I trained a similar UT but in completely sequence-to-sequence style instead of more traditional next-token prediction style.
- For a pretty hard to OOD extrapolate task (prefix sums; basically a rolling parity check & reverse string) I found that I didn’t really need fancier positional encodings, or custom prompts to get OOD extrapolation. Simply bumping up the iterations (i.e number of recursive applications i)
- [Bad hand-drawn graph incoming]; Here complexity refers to length of input string.
- FWIW vanilla transformers fail at pretty much every single OOD task you throw at them. It requires significant effort to get them to somewhat extrapolate OOD without immediately collapsing, and that’s on toy tasks.
Figure 14 in this lovely paper, and another paper by Robert Csordas (really nice guy, he does quite a bit of UT and extrapolation adjacent research and we had a nice convo) are some small hidden gems that show how well UTs work for generalization but are often ignored in a lot of contemporary research.
- [From Csordas et al.]

There are a lot of angles one could take. I could focus on OOD extrapolation which is much more cooler, but broader in scope - or get an initial paper out that pretrains and evals a model on reasoning benchmarks and build upon on it later in follow up work.

I’m leaning towards the latter but that’s mostly an arbitrary choice, and basically a bet towards what I think would work well for conferences and attracting labs.

(In follow up works, I would love to do a bit of mechanistic interpretability and go deeper into why things work and what the model discovers - maybe even test some synthetic tasks to see whether the priors from a pretrained model aid in OOD extrapolation and other capabilities - like whether it can do multi-hop reasoning*)*

There is a also neat theoretical insight that I feel could make for an interesting aside in the paper.

A common argument that comes up is that humans rely more on search based heuristics when solving problems. That is, when approaching novel information, we contemplate or search through the various hypotheses until we arrive at one that appears to fit the data with a high likelihood. So the hypothesis is that the few-shot learning capability is directly dependent on our ability to search efficiently.