https://arxiv.org/pdf/1909.11942v2.pdf

side note: I found this paper by looking for examples of when models were too big. I was trying to figure out why residual blocks made intuitive sense with Jason. The examples were that you would lose information by making too complex of a model when a shallower model wouldn't have lost it.

ALBERT is an extension of BERT and tries to answer the question: are larger models the answer to NLP tasks? ALBERT can achieve SOTA results by cross-layer parameter sharing. By sharing parameters, ALBERT can be smaller with similar if not improved performance. The best results from Albert are with more parameters - but it still trains faster than BERT. And when they train for the same amount of wall-time, ALBERT performs better than BERT.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/badff841-4584-4f33-9a45-4638802b1463/Untitled.png

They also found that ALBERT models with more parameters and attention heads (wider), did not require deeper models to perform better.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/8f08a8c0-cc20-43fe-8f94-6506614c9ff7/Untitled.png

These results are promising, showing that more complex, larger, deeper models do not necessarily result in better results. The paper suggests size does matter and that there is an ideal size that may be best to model language.

what is the high-level one sentence summary

more detailed than the abstract and less