This diagram shows the end-to-end path from input messages to final MBTI prediction, with all dimensions and internal logic labeled.
%%{init: {'flowchart': {'useMaxWidth': true}} }%%
graph TD
subgraph "Input & Embedding"
A["Input IDs<br/>[8, 256]"] --> B["Word Embeddings<br/>[8, 256, 768]"]
C["Positional IDs<br/>[256]"] --> D["Pos Embeddings<br/>[1, 256, 768]"]
B & D --> E["X' (Sum)<br/>[8, 256, 768]"]
end
subgraph "Multi-Head Attention (1 of 12 Layers)"
E --> F["Wq, Wk, Wv Weights<br/>[768, 768]"]
E & F --> G["Q, K, V Projections<br/>[8, 256, 768]"]
G --> H["Split into 8 Heads<br/>[8, 8, 256, 64]"]
subgraph "Inside Each Head"
H --> I["Q_head [256, 64] ×<br/>K_head_Transpose [64, 256]"]
I --> J["Scores Matrix<br/>[256, 256]"]
J --> K["Scale (/8) &<br/>Softmax Layers"]
K --> L["Alphas (Weights)<br/>[256, 256]"]
L --> M["Alphas × V_head<br/>[256, 64]"]
M --> N["Head Output<br/>[256, 64]"]
end
N --> O["Concat All 8 Heads<br/>[8, 256, 768]"]
O --> P["Final Linear Projection<br/>[8, 256, 768]"]
end
subgraph "Residual & FFN"
P --> Q_res["Add & LayerNorm<br/>[8, 256, 768]"]
Q_res --> R["Feed Forward Network<br/>(768 -> 3072-> 768)"]
R --> S["Add & LayerNorm<br/>[8, 256, 768]"]
end
subgraph "Output Prediction Head"
S --> T["Mean Pooling<br/>[8, 768]"]
T --> U["Classifier (Linear Output)<br/>[8, 16]"]
U --> V_out["Softmax -> Argmax<br/>-> MBTI Type"]
end

Traditional optimizers like SGD (Stochastic Gradient Descent) are like a ball rolling down a bill it might get stuck in a small pothole. AdamW is like a specialized vehicle designed for rough terrain.
Standard Cross-Entropy Loss treats all errors the same. If the model is 90% sure about an ENFP, but 10% sure about a rare ESTJ, it might ignore the ESTJ to focus on making the ENFP 91% sure. This is bad for MBTI because some types are very rare.
Focal Loss fixes this with a special mathematical "multiplier": (1 - p_t)^\gamma.
class_weights. Alpha uses these to give a "bonus" to rare classes (like INTJ or ENTJ) so they have a louder voice during training.