By: Thomas Stahura
Escaping a rogue self-driving Tesla is simple: climb a flight of stairs.
While a Model Y can’t climb stairs, Tesla’s new humanoid surely can. If Elon Musk and the Tesla bulls have their way, humanoids could outnumber humans by 2040. That means there’s quite literally nowhere left to hide — the robot revolution is upon us.
Of course, Musk isn’t alone in building humanoids. Boston Dynamics has spent decades stunning the internet with robot acrobatics and dancing. For $74,500, you can own Spot, its robot dog. Agility Robotics in Oregon and Sanctuary AI in British Columbia are designing humanoids for industrial labor, not the home. China’s Unitree Robotics is selling a $16,000 humanoid today.
These machines may feel like a sudden leap into the future, but the idea of humanoid robots has been with us for centuries. Long before LLMs and other abstract technologies, robots were ingrained in culture, mythology, and our collective engineering dreams.
Around 1200 BCE, the ancient Greeks told stories of Talos, a towering bronze guardian patrolling Crete. During the Renaissance, Leonardo da Vinci sketched his mechanical knight. The word “robot” itself arrived in 1920 with Karel Čapek’s play R.U.R. (Rossum’s Universal Robots). By 1962, The Jetsons brought Rosie the Robot into American homes. And in 1973, Japan’s Waseda University introduced WABOT-1, the first full-scale — if clunky — humanoid robot.
Before the advent of LLMs, the vision was to create machines that mirror the form and function of a human being. Now it seems the consensus is to build a body for these models. Or rather, to build models for these bodies.
They're calling it a vision-language action (VLA) model and it's a new architecture purpose-built for general robot control. Currently, there are two types of model architectures dominating the market, transformer and diffusion. Transformer models are used to process and predict sequential data, think text generation, while diffusion models are used to generate continuous data through an iterative denoising process, think image generation.
VLA models (like π0) combine elements from both approaches to address the challenges of robotic control in the real-world. These hybrid architectures enable robots to translate visual observations (from cameras) and language instructions (robots given task) into precise physical actions using the sequential reasoning of transformers and the continuous output of diffusion models. Other frontier VLA model startups include: Skild (reportedly in talks to raise $500 million at a $4 billion valuation); Hillbot; and Covariant.
A new architecture means a new training paradigm. Lucky Robots (Ascend portfolio company) is pioneering synthetic data generation for VLA models by having robots learn in a physics simulation enabling developers to play with these models without needing a real robot. Nvidia is cooking up something similar with its Omniverse platform.
Some believe that more data and better models will lead to an inflection point in robotics, similar to what happened with large language models. However, unlike text and images, physical robotics data cannot be scraped from the web and must either be collected by an actual robot, or synthesized in a simulation. Regardless of how the model is trained, a real robot is needed to act upon the world.
At the very least, it’s far from a solved problem. Since a robot can have any permutation of cameras, joints, and motors, making a single unified model that can inhabit every robot is extremely challenging. Figure AI (valued at $2.6 billion, of which OpenAI is an investor) recently dropped OpenAI’s models in favor of in-house models. It’s not alone. So many VLA models are being uploaded to Hugging Face that the platform had to add a new model category just to keep up.
The step from concept to reality has been a long one for humanoid robots, but the pace of progress suggests we're just getting started.
P.S. If you have any questions or just want to talk about AI, email me! thomas@ascend.vc