“drop the ‘the’ just facebook it’s cleaner” — “Social Network” (link)

This Thinky blog post (https://thinkingmachines.ai/blog/on-policy-distillation/) have some sleek definitions of different stages of LLM training, upon which I elaborate a bit more:

Pre-training

Post-Training:

⭐️Mid-Training⭐️:

Now, you see there isn’t really a formal definition of pre-/mid-/post-training. The real question are: What is the scale of the dataset? What inherent structure does the dataset have? How can we exploit them with data curation or learning algorithms?

The reason why JEPA will not work on language is that it has too many assumptions about the data structure that the learning objective, i.e., contrastive learning of joint embeddings, cannot generalize to internet scale.