XLM-Roberta | Notion

1. Top-Level Architectural Flow

This diagram shows the end-to-end path from input messages to final MBTI prediction, with all dimensions and internal logic labeled.

%%{init: {'flowchart': {'useMaxWidth': true}} }%%
graph TD
    subgraph "Input & Embedding"
        A["Input IDs<br/>[8, 256]"] --> B["Word Embeddings<br/>[8, 256, 768]"]
        C["Positional IDs<br/>[256]"] --> D["Pos Embeddings<br/>[1, 256, 768]"]
        B & D --> E["X' (Sum)<br/>[8, 256, 768]"]
    end

    subgraph "Multi-Head Attention (1 of 12 Layers)"
        E --> F["Wq, Wk, Wv Weights<br/>[768, 768]"]
        E & F --> G["Q, K, V Projections<br/>[8, 256, 768]"]
        G --> H["Split into 8 Heads<br/>[8, 8, 256, 64]"]
        
        subgraph "Inside Each Head"
            H --> I["Q_head [256, 64] ×<br/>K_head_Transpose [64, 256]"]
            I --> J["Scores Matrix<br/>[256, 256]"]
            J --> K["Scale (/8) &<br/>Softmax Layers"]
            K --> L["Alphas (Weights)<br/>[256, 256]"]
            L --> M["Alphas × V_head<br/>[256, 64]"]
            M --> N["Head Output<br/>[256, 64]"]
        end
        
        N --> O["Concat All 8 Heads<br/>[8, 256, 768]"]
        O --> P["Final Linear Projection<br/>[8, 256, 768]"]
    end

    subgraph "Residual & FFN"
        P --> Q_res["Add & LayerNorm<br/>[8, 256, 768]"]
        Q_res --> R["Feed Forward Network<br/>(768 -> 3072-> 768)"]
        R --> S["Add & LayerNorm<br/>[8, 256, 768]"]
    end

    subgraph "Output Prediction Head"
        S --> T["Mean Pooling<br/>[8, 768]"]
        T --> U["Classifier (Linear Output)<br/>[8, 16]"]
        U --> V_out["Softmax -> Argmax<br/>-> MBTI Type"]
    end

768 -> 2048 (The Workspace): Imagine you are trying to solve a complex puzzle. You move from a small desk (768) to a much bigger table (2048) so you can spread all the pieces out. In this 2048-dimension space, the model applies a non-linear function (GELU). This is where the actual "reasoning" happens—it's much harder to do this kind of complex math in a cramped 768-dim space.
2048 -> 768 (The Result): Once the "reasoning" is done, we have to move the result back to the small desk (768) so we can Add it to the original input (the "Residual Connection"). If we didn't go back to 768, the shapes wouldn't match for the addition!

1. AdamW Optimizer (Adaptive Momentum with Weight Decay)

Traditional optimizers like SGD (Stochastic Gradient Descent) are like a ball rolling down a bill it might get stuck in a small pothole. AdamW is like a specialized vehicle designed for rough terrain.

Adaptive Learning Rate: AdamW calculates a different learning rate for every single weight in the model.
- Weights that get frequent, large gradients (important features) are updated more cautiously.
- Weights that get rare, tiny gradients (subtle features) are updated more aggressively.
Momentum: It remembers the "average" direction of previous steps. If the last 10 steps were "left," it will keep moving left even if the current step says "right," effectively smoothing out noisy gradients.
Weight Decay (W): This is the most important part of AdamW (The "W"). In standard Adam, weight decay is mixed with the adaptive gradients, which can be messy. AdamW decouples it.
- The Logic: At every step, it slightly "shrinks" all weights toward zero.
- The Benefit: This prevents any single weight from becoming too large (overfitting). It forces the model to be simple and robust, leading to much better generalization on the 1,147 test samples you just evaluated.

2. Focal Loss (The "Hard Example" Specialist)

Standard Cross-Entropy Loss treats all errors the same. If the model is 90% sure about an ENFP, but 10% sure about a rare ESTJ, it might ignore the ESTJ to focus on making the ENFP 91% sure. This is bad for MBTI because some types are very rare.

Focal Loss fixes this with a special mathematical "multiplier": (1 - p_t)^\gamma.

How it works:

Down-weighting "Easy" Samples: If the model is already confident about a sample (p_t is high), the term (1 - p_t) becomes very small. This "kills" the loss for that sample. The model says: "I already know this ENFJ, let's stop wasting time on it."
Focusing on "Hard" Samples: if the model is confused (p_t is low), (1 - p_t) remains large. The loss stays high, forcing the model to work harder to learn that specific sample.
The Parameters:
- Gamma (\gamma = 2.0): This controls how much we ignore easy samples. Higher gamma = more focus on the hardest cases.
- Alpha (\alpha): In your project, you computed class_weights. Alpha uses these to give a "bonus" to rare classes (like INTJ or ENTJ) so they have a louder voice during training.