(ASymptotically Universal Recursive Architecture)

<aside> đź’ˇ Use cmd/ctrl + shift + L to switch to Dark Mode

</aside>

Outline

In essence: ASURA is an iterative improvement over prior work, wherein it only uses self-attention blocks in a DEQ/Universal Transformer style.

A core focus of this architecture is simplicity and scalability. Following the bitter lesson: architectures that are scalable often trump those relying on too many priors and inductive biases. Thus, I want to lean towards advertising the architecture’s generality and applicability to various domains.

Another view of UTs is that its simply an aggressive weight sharing scheme across the depth dimension instead of time dimension, like in traditional RNNs. There’s quite a lot of a OOD extrapolation work that mentions how effective weight sharing acts almost like a regularisation. The inductive bias often helps a lot, if not completely, in many situations.

There are a lot of angles one could take. I could focus on OOD extrapolation which is much more cooler, but broader in scope - or get an initial paper out that pretrains and evals a model on reasoning benchmarks and build upon on it later in follow up work.

I’m leaning towards the latter but that’s mostly an arbitrary choice, and basically a bet towards what I think would work well for conferences and attracting labs.

(In follow up works, I would love to do a bit of mechanistic interpretability and go deeper into why things work and what the model discovers - maybe even test some synthetic tasks to see whether the priors from a pretrained model aid in OOD extrapolation and other capabilities - like whether it can do multi-hop reasoning*)*

There is a also neat theoretical insight that I feel could make for an interesting aside in the paper.