Token Talk 5: Big Models Teach, Small Models Catch Up.

By: Thomas Stahura

O3-mini is amazing and totally free. OpenAI achieved this through distillation from the yet-released larger o3 model.

Right now, the model ranks second globally — beating DeepSeek R1 but trailing the massive o1. Estimates put o1 at 200-300 billion parameters, DeepSeek at 671 billion, and o3-mini at just 3-30 billion. (The only reasoning models to top the benchmarks this week.)

What’s remarkable is that o3-mini achieves intelligence close to o1 while being just one-hundredth its size, thanks to distillation.

There are a variety of distillation techniques; but, at a high level, distillation involves using a larger teacher model to teach a smaller student model.

For example, GPT-4 (1.4 trillion parameter model) was trained on a million GBs of public internet data (one petabyte). GPT-4 was trained to represent that data, to represent the internet.

The resulting 1.4 trillion parameter model, if downloaded, would occupy 5,600 GB, or 5.6 terabytes of space. In a sense, you can think of GPT-4 (or any LLM) as a highly compressed queryable representation of the training set, in this case the internet. After all, going from 1 petabyte to 5.6 terabytes is a 99.45% reduction.

So, how does this apply to distillation? If you think about models in terms of compression of the training dataset, then you can “uncompress” that training dataset by querying the larger teacher model, in this case GPT-4. Until you generate 1 petabyte of synthetic data, then use that dataset to train or fine-tune a smaller student (3-10 billion parameter) model to mimic the larger teacher model in performance.

This remains an active area of research today.

Of course, distilling from a closed-source model is strictly against OpenAI’s terms of service. Though, that didn’t stop DeepSeek, which is currently being probed by Microsoft over synthetic data training allegations.

The cats out of the bag. OpenAI themselves distilled o3-mini from o3, and Microsoft distilled phi-3.5-mini-instruct from phi-3.5. It seems like from now on, whatever model performs the best will become the “teacher” for all the “student” models, which will be fine-tuned to quickly catch up to it in performance. This new paradigm shifted the AI industry's focus from LLMs to AI applications with the main one being agents.

OpenAI (in addition to launching o3-mini) debuted a new web agent called deep research (only available at the $200 / month tier). I’ve used many web agents and browsers like browser base, browser-use, and computer-use. I have buddies who are building CopyCat (YC W25), and I’ve even built my own browser agent. All this to say the AI application space is heating up!

Stay tuned because I’ll talk more about agents next week!

P.S. If you have any questions or just want to talk about AI, email me: thomas @ ascend dot vc