Test Time Compute — Blog

By: Thomas Stahura

Reasoning models are branded as the next evolution of large language models (LLMs). And for good reason.

These models, like OpenAI’s o3 and High-Flyer’s DeepSeek, rely on test-time compute. Essentially, they think before speaking by writing their train of thought before producing a final answer. (This type of LLM is called a “reasoning model.”)

Reasoning models are showing terrific benchmark improvements! AI researchers (and the public at large) demand better performing models, and there are five ways to do so: data, training, scale, architecture, and inference. At this point, almost all public internet data is exhausted, models are trained at every size and scale, and transformers have dominated most architectures since 2017. This leaves inference, which, for the time being, seems to be improving AI test scores.

OpenAI’s o3 nails an 87% on GPQA-D and achieves 75.5% on the ARC Prize (at a $10,000 compute limit). However, the true costs remain (as of Jan 2025) a topic of much discussion and speculation. Discussion on OpenAIs Dev Forum suggests, per query, roughly $60 for o3-mini and $600 for o3. Seems fair; however, whatever the costs are at the moment, OpenAIs research will likely be revealed, fueling competition, eventually lowering costs for all.

One question still lingers: How exactly did OpenAI make o3?

There exists no dataset on the internet of questions, logically sound steps, and correct answers. (Ok, maybe Chegg, but they might be going out of business.) Anyways, much of the data is theorized to be synthetic.

StaR (Self-Taught Reasoner) is the subject of a research paper that suggests a technique to turn a regular LLM into a reasoning model. The paper calls for using an LLM to generate a dataset of rationals, then use that dataset to fine-tune the same LLM to become a reasoning model. StaR relies on a simple loop to make the dataset: generate rationales to answer many questions; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; and repeat.

It's now 2025 and the AI world moves FAST. Many in the research community believe the future are models that can think outside of language. This is cutting-edge research as of today.

I plan to cover more as these papers progress, so stay tuned!