By: Thomas Stahura
The race to dominate artificial intelligence is accelerating on every front, as research labs across the globe push full throttle on new model releases while governments move to cement AI supremacy.
In the past few weeks, Google released two major models, OpenAI launched long-awaited image capabilities, and Chinese labs pushed open-source systems that rival the best from the West. What began as a battle between private research labs is now a global competition shaped by open models, national strategies, and shifting power dynamics.
Here's a breakdown of what just happened:
Google announced Gemma 3, the latest model in its Gemma trilogy. At around 27 billion parameters, I wouldn’t call it “small,” yet it punches above its weight class. It’s the only open model that can take video as input. Mistral open-sourced Mistral-Small-3.1 a few days later, a 24 billion parameter model that outperforms Gemma 3 on most benchmarks.
But really the larger news here is Gemini 2.0 Flash Experimental. Google’s new closed-source flagship model and the company's first unified multimodal model. Meaning, the model can generate and understand both images and text simultaneously. I’ve been playing around with it. It is capable of editing images using simple text prompts, generating each frame of a GIF, and even composing a story complete with illustrations. (This is similar to Seattle startup 7Dof, which showcased a visual chain-of-thought editing tool at South Park Commons last year.)
Traditionally, transformer models were used to generate text, while diffusion models generate images. Today, researchers are experimenting with unifying both architectures into a single model (similar to what is going on with VLA models in robotics). The ultimate goal is to build a model that unifies the text, image, and audio spaces.
Gpt-4o has had image generating abilities for a while. Greg Brokman demoed gpt-4o generating images in May. And this week the company finally launched the capability.
At this point in the AI race, OpenAI seems to be reacting more than leading. Launching 4o’s image gen was a response to Gemini 2.0 Flash Experimental.
Trump said multiple times he wants “American AI Dominance.” And, to that effect, the White House invited public comment on its AI Action Plan. OpenAI published its response, slamming DeepSeek and urging the administration to implement the following:
An export control strategy that exports democratic AI
A copyright strategy that promotes the freedom to learn
A strategy to seize the infrastructure opportunity to drive growth
And an ambitious government adoption strategy.
Google also responded, urging America to:
Invest in AI
Accelerate and modernize government AI adoption
Promote pro-innovation approaches internationally
China has their own plan.
Dubbed the “New Generation Artificial Intelligence Development Plan” (2017), the agenda aims to make China the global leader in AI by 2030. The worry seems to be about the sheer quality and openness of the models out of China today. It’s hard to name a model out of a Chinese AI lab that isn’t open source.
Over the course of a week earlier this month, DeepSeek open-sourced all technical details used in the creation of its R1 and V3 models. All except for the actual dataset used to train the models (adding to the suspicion that DeepSeek trained on gpt-4o outputs).
DeepSeek also open-sourced Janus-Pro. Though the model got significantly less attention than its big brother, Janus-Pro is a unified multimodal model (like Gemini 2.0 Experimental), capable of generating and understanding both images and text — one of the first open-source models of its kind.
Qwen, the AI lab out of Alibaba Cloud, has launched its own reasoning model: QwQ-32B, competing with and reaching DeepSeek R1 performance on many benchmarks. The model already has 615k downloads on Hugging Face.
OpenBMB (Open Lab for Big Model Base) is a Chinese AI research group out of Tsinghua University. The group is most known for MiniCPM-o-2_6, a unified multimodal model capable of understanding images, text, and speech, as well as generating text and speech. The model is at gpt-4o levels, according to the benchmarks, and has 766k downloads.
DeepSeek V3.1 also launched this week. The model leapfrogged Grok 3 and Claude 3.7 to become the best performing non-resoning model. The first time an open-source model achieved SOTA.
That is until Google 2.5 Experimental dropped a few hours later. More on that next week.
Ok, here’s my take on the flood of releases:
This is good news for startups, full stop. More models means more competition, and that means lower prices. Even if the U.S. bans Chinese models, most are fully open. Developers can fine-tune them and build whatever they need.
The real challenge now is the viability of America’s top AI labs. If China can flood the market with cheap, open, high-quality models, they could undercut their U.S. counterparts. It’s a familiar playbook — one China used before in other industries. This time, it’s electrons instead of atoms. That shift might tilt the board in China’s favor.
Only time will tell, so stay tuned!