1. Top-Level Architectural Flow

This diagram shows the end-to-end path from input messages to final MBTI prediction, with all dimensions and internal logic labeled.

%%{init: {'flowchart': {'useMaxWidth': true}} }%%
graph TD
    subgraph "Input & Embedding"
        A["Input IDs<br/>[8, 256]"] --> B["Word Embeddings<br/>[8, 256, 768]"]
        C["Positional IDs<br/>[256]"] --> D["Pos Embeddings<br/>[1, 256, 768]"]
        B & D --> E["X' (Sum)<br/>[8, 256, 768]"]
    end

    subgraph "Multi-Head Attention (1 of 12 Layers)"
        E --> F["Wq, Wk, Wv Weights<br/>[768, 768]"]
        E & F --> G["Q, K, V Projections<br/>[8, 256, 768]"]
        G --> H["Split into 8 Heads<br/>[8, 8, 256, 64]"]
        
        subgraph "Inside Each Head"
            H --> I["Q_head [256, 64] ×<br/>K_head_Transpose [64, 256]"]
            I --> J["Scores Matrix<br/>[256, 256]"]
            J --> K["Scale (/8) &<br/>Softmax Layers"]
            K --> L["Alphas (Weights)<br/>[256, 256]"]
            L --> M["Alphas × V_head<br/>[256, 64]"]
            M --> N["Head Output<br/>[256, 64]"]
        end
        
        N --> O["Concat All 8 Heads<br/>[8, 256, 768]"]
        O --> P["Final Linear Projection<br/>[8, 256, 768]"]
    end

    subgraph "Residual & FFN"
        P --> Q_res["Add & LayerNorm<br/>[8, 256, 768]"]
        Q_res --> R["Feed Forward Network<br/>(768 -> 3072-> 768)"]
        R --> S["Add & LayerNorm<br/>[8, 256, 768]"]
    end

    subgraph "Output Prediction Head"
        S --> T["Mean Pooling<br/>[8, 768]"]
        T --> U["Classifier (Linear Output)<br/>[8, 16]"]
        U --> V_out["Softmax -> Argmax<br/>-> MBTI Type"]
    end
  1. 768 -> 2048 (The Workspace): Imagine you are trying to solve a complex puzzle. You move from a small desk (768) to a much bigger table (2048) so you can spread all the pieces out. In this 2048-dimension space, the model applies a non-linear function (GELU). This is where the actual "reasoning" happens—it's much harder to do this kind of complex math in a cramped 768-dim space.
  2. 2048 -> 768 (The Result): Once the "reasoning" is done, we have to move the result back to the small desk (768) so we can Add it to the original input (the "Residual Connection"). If we didn't go back to 768, the shapes wouldn't match for the addition!

image.png

1. AdamW Optimizer (Adaptive Momentum with Weight Decay)

Traditional optimizers like SGD (Stochastic Gradient Descent) are like a ball rolling down a bill it might get stuck in a small pothole. AdamW is like a specialized vehicle designed for rough terrain.


2. Focal Loss (The "Hard Example" Specialist)

Standard Cross-Entropy Loss treats all errors the same. If the model is 90% sure about an ENFP, but 10% sure about a rare ESTJ, it might ignore the ESTJ to focus on making the ENFP 91% sure. This is bad for MBTI because some types are very rare.

Focal Loss fixes this with a special mathematical "multiplier": (1 - p_t)^\gamma.

How it works:

  1. Down-weighting "Easy" Samples: If the model is already confident about a sample (p_t is high), the term (1 - p_t) becomes very small. This "kills" the loss for that sample. The model says: "I already know this ENFJ, let's stop wasting time on it."
  2. Focusing on "Hard" Samples: if the model is confused (p_t is low), (1 - p_t) remains large. The loss stays high, forcing the model to work harder to learn that specific sample.
  3. The Parameters: