Visualizing and Measuring the Geometry of BERT

key ideas:

In a space that is sufficiently high-dimensional, a randomly branching embedding of a tree, where each child is offset from its parent by a random unit Gaussian vector, will be approximately pythagorean.
They use attention probes to classify relations between two tokens. The intuition behind this is that if a linear model can achieve reliable accuracy than the model-wide attention vector should also encode that relationship.
- Model wide attention vector is created by concatenating every entry in every attention matrix from every attention head in every layer. Look at figure 1 for a picture. So if you had 12 layers and 12 heads, you would have a 144 dimensional attention vector
BERT's context embeddings are a pythagorean embedding of a sentence embedding