Combine Deep Equilibrium Models with Incremental RNNs

DEQs

Deep Equilibrium Models (DEQs) are something that resembles a recurrent layer (receives as input it's own output), but does it a variable amount of time. How many times? Infinite. Infinite !? Yes, sort of, the authors show that deep stacked transformer behave like contractive functions, meaning that they find a stable output, meaning that the output is the same as the input.

In the equation above, z* would be the stable output, also called equilibrium, an x is constant (the input, for example, a sentence).

As you can see, if we applying the function on it's previous output, the new output is now going to change, thus it could be though that we applied the function an infinite number of times, because it's the same z*.

The authors propose that instead of applying the function a variable number of times until we find an equilibrium, we directly solve for it using a Root-finding method, which can be much faster.

Root-finding means finding the value in which a function has output 0. So, we can define a new function g, and find z* as:

iRNNs

That's most of the important stuff I think I have to say to contextualize DEQs, now going to the next paper, Incremental RNNs (RNNs Incrementally Evolving on an Equilibrium Manifold: A Panacea for Vanishing and Exploding Gradients?) are a new formulation of the recurrent neural networks equations, that is based on the idea that if we repeatedly apply the recurrent unit in a single timestep, we would eventually reach an equilibrium in the hidden state.

If x_t is the input at timestep t, h_{t-1} the previous hidden state, and {h_t} the current hidden step, they pretend to find an h_t such that:

$$ h_t = RNN(x_t, h_t, h_{t-1}) $$

You can see h_t is in both sides, so it is and equilibrium point. Among the benefits of the proposed formulation, the prove that this model DOES NOT SUFFER FROM VANISHING GRADIENTS:

And experiments show it's incredible capacity to deal with long range dependencies (theoretical part of the paper is very complicated, I had to do a lot of study of linear algebra to understand it, not completely yet, and could have been a lot easier for the readers, but experimental part is super cool!)

Idea

To find the equilibrium they just apply the recurrence a predefined amount of times (let's say 5), so I think the next logical step would be to apply the root-finding of DEQs on this RNNs, and extrapolate the formulation of iRNNs to other models like ResNets of Transformers, and see if the constant gradient and (almost) guaranteed equilibrium help.

Relevant resources

Reviving and Improving Recurrent Back-Propagation