https://nlp.stanford.edu//~johnhew//structural-probe.html?utm_source=quora&utm_medium=referral#the-structural-probe
https://pair-code.github.io/interpretability/bert-tree/
key ideas:
- In a space that is sufficiently high-dimensional, a randomly branching embedding of a tree, where each child is offset from its parent by a random unit Gaussian vector, will be approximately pythagorean.
- They use attention probes to classify relations between two tokens. The intuition behind this is that if a linear model can achieve reliable accuracy than the model-wide attention vector should also encode that relationship.
- Model wide attention vector is created by concatenating every entry in every attention matrix from every attention head in every layer. Look at figure 1 for a picture. So if you had 12 layers and 12 heads, you would have a 144 dimensional attention vector
- BERT's context embeddings are a pythagorean embedding of a sentence embedding