By: Thomas Stahura
Sam Altman’s manifest destiny is clear: achieve AGI.
There is little consensus on what AGI actually means. Altman defines it as “the equivalent of a median human that you could hire as a coworker and they could do anything that you’d be happy with a remote coworker doing.”
Dario Amodei, Anthropic founder and CEO, says AGI happens “when we are at the point where we have an AI model that can do everything a human can do at the level of a Nobel laureate across many fields.”
Demis Hassabis, CEO of Google DeepMind, puts it more succinctly. AGI, he says, is “a system that can exhibit all the cognitive capabilities humans can.”
If AGI is inevitable, the next debate is over timing. Altman thinks this year. Amodei says within two. Hassabis sees it arriving sometime this decade.
As I mentioned last week, AI researchers are working to unify multiple modalities — text, audio, and images — into a single model. These so-called “omni” models can natively generate and understand all three. GPT-4o is one of them. The “o” meaning Omni. It has handled both text and speech for nearly a year. But image generation was still ruled by diffusion models, until last week.
It began with a research paper from a year ago out of Peking University and ByteDance. The paper introduced Visual AutoRegressive modeling, or VAR. The approach uses coarse-to-fine next-scale prediction to generate images more efficiently. It does this by predicting image details at increasing resolutions, starting with a low-resolution base image and progressively adding resolution to it, which improves both speed and quality over conventional GPT-style raster-scan or diffusion denoising methods.
Put simply, VAR enables GPT-style models to overtake diffusion for image generation at large scales.
Qwen-2.5 Omni, the open-source omni model from China I referenced last week, may be an early sign of where things are heading. In its research paper, they wrote, “We believe Qwen2.5-Omni represents a significant advancement toward artificial general intelligence (AGI).”
Is omni a leap toward AGI? That’s the bet labs are making.
And generative model-native startups will need to respond. Companies like Midjourney and Stability, still rooted in diffusion, will likely have to build their own GPT-style image generators to compete. Not just for images, but potentially across all modalities. The same pressure may extend to music and video, pushing startups like Suno, Udio, Runway, and Pika to expand beyond their core businesses. This will be over years not months, especially for video. Regardless, I'm certain researchers at OpenAI, Anthropic, Google, and Microsoft are actively training their next gen omni models.
OpenAI has a lot riding on AGI. If it gets there first, Microsoft loses access to OpenAI’s most advanced models.
Tensions between the two have been building for months. The strain began last fall, when Mustafa Suleyman, Microsoft’s head of AI, was reportedly “peeved that OpenAI wasn’t providing Microsoft with documentation about how it had programmed o1 to think about users’ queries before answering them.”
The frustration deepened when Microsoft found more value in the free DeepSeek model than in its $14 billion investment in OpenAI.
Microsoft is already developing its own foundation model, MAI, which is rumored to match OpenAI’s performance. OpenAI, meanwhile, just closed a $40 billion tender offer on the strength of GPT-4o and its new image generator, an update more significant than most realize.
From the outside, it appears AGI is near. Granted I suspect it will be around the 2030s when we’ll feel the impacts. My own working definition: a model capable of performing all economically valuable work on a computer, across all domains.
What that means for the labor market is another story. Stay tuned!