(ASymptotically Universal Recursive Architecture)
<aside>
đź’ˇ Use cmd/ctrl + shift + L
to switch to Dark Mode
</aside>
ASURA
is based on some prior work - here are some citations ranked by importance:
In essence: ASURA
is an iterative improvement over prior work, wherein it only uses self-attention blocks in a DEQ/Universal Transformer style.
n
self-attention blocks applied recursively i
times, with a long skip connection borrowed from Bansal et al. that alleviates most of the gradient problems. For this project, i
and n
will both be constant. I’ve poured (too much) effort into dynamically changing i
during training but its hard to optimize.A core focus of this architecture is simplicity and scalability. Following the bitter lesson: architectures that are scalable often trump those relying on too many priors and inductive biases. Thus, I want to lean towards advertising the architecture’s generality and applicability to various domains.
Another view of UTs is that its simply an aggressive weight sharing scheme across the depth dimension instead of time dimension, like in traditional RNNs. There’s quite a lot of a OOD extrapolation work that mentions how effective weight sharing acts almost like a regularisation. The inductive bias often helps a lot, if not completely, in many situations.
I did a bit of rough exploration (twitter thread) a few months ago in which I trained a similar UT but in completely sequence-to-sequence style instead of more traditional next-token prediction style.
i
)Figure 14 in this lovely paper, and another paper by Robert Csordas (really nice guy, he does quite a bit of UT and extrapolation adjacent research and we had a nice convo) are some small hidden gems that show how well UTs work for generalization but are often ignored in a lot of contemporary research.
[From Csordas et al.]
There are a lot of angles one could take. I could focus on OOD extrapolation which is much more cooler, but broader in scope - or get an initial paper out that pretrains and evals a model on reasoning benchmarks and build upon on it later in follow up work.
I’m leaning towards the latter but that’s mostly an arbitrary choice, and basically a bet towards what I think would work well for conferences and attracting labs.
(In follow up works, I would love to do a bit of mechanistic interpretability and go deeper into why things work and what the model discovers - maybe even test some synthetic tasks to see whether the priors from a pretrained model aid in OOD extrapolation and other capabilities - like whether it can do multi-hop reasoning*)*
There is a also neat theoretical insight that I feel could make for an interesting aside in the paper.