Why AI Needs Two Billion Frames to Understand a Single Step
On scale, generalisation, and what it means when a machine finally "gets" movement
A human infant takes about a year to learn to walk. In that year, they generate roughly a few million instances of weight shift, balance correction, and motor adjustment — trying, falling, recovering, trying again — each one updating the nervous system's model of how this particular body navigates gravity.
A new AI motion model called Humanoid-GPT, presented this week at CVPR 2026, was trained on two billion frames of human movement data before it could reliably track and follow unfamiliar motion. Two billion, compared to a baby's few million. And unlike the baby, the AI model does not have a body. It has never fallen. It cannot feel what it is like to catch your balance at the last moment.
Why does it need so much more data? And what does "understanding movement" even mean for a system that has no experience of moving?
What Scale Does
Humanoid-GPT is a GPT-style model — the same kind of architecture that powers large language models — but trained on movement rather than text. Instead of predicting the next word in a sentence, it predicts the next frame of a motion sequence. Instead of learning grammar and meaning from billions of sentences, it learns the grammar and physics of human movement from billions of motion frames.
The result, according to the researchers, is a model that can zero-shot track unfamiliar movement — meaning it can follow and respond to a movement sequence it has never seen before, without any additional training. Show it a martial arts kata it was never trained on, and it tracks it. Show it a dance phrase from a style absent from its training data, and it follows the phrasing, the transitions, the dynamic qualities.
This is what scale buys: generalisation. A small model trained on limited data learns to handle the specific cases it saw during training, but fails on novel inputs. A large model trained on diverse data internalises something more general — the underlying structure of how human bodies move through space and time, the constraints that all movement obeys, the patterns that recur across movement styles and bodies.
The Difference Between Tracking and Understanding
There is an important distinction here, though, between tracking movement and understanding movement in the sense that a somatic practitioner means.
Humanoid-GPT is a tracking model: given a sequence of movement, predict what comes next and follow it. It is very good at this, at a scale that was not achievable before. What it is not doing is experiencing the movement from the inside — feeling the weight shift, the muscular engagement, the quality of the breath at the top of the arc before descent.
Think of the difference between a master class teacher who watches a student move and corrects them from outside observation, versus a practice partner who physically leads and follows you and feels the moment of resistance or release through shared contact. Both are engaged with the movement; both are responding to what they perceive. But they are perceiving different things, and their responses emerge from different kinds of knowing.
Humanoid-GPT is the master class teacher — extraordinarily knowledgeable, watching from the outside, able to follow and predict what the body will do next. The nervous system of a trained mover is more like the contact partner — knowing from the inside, from the felt quality of the exchange.
Neither mode of knowing is more real or valuable than the other. But they are different, and conflating them leads to misunderstanding what AI motion systems can and cannot do.
Why Generalisation Matters Anyway
The zero-shot generalisation result still matters enormously, even if it is not the same as embodied understanding. Here is why.
Before models like Humanoid-GPT, every new movement domain required a new model. If you wanted an AI system that could respond to contemporary dance, you trained a model on contemporary dance data. If your practitioner improvised in a way that departed significantly from the training distribution, the model would fail — generating responses that were generic, laggy, or simply wrong.
Zero-shot generalisation changes this. A model that has absorbed enough of the underlying structure of human movement to follow unfamiliar motion sequences without additional training can be deployed in a co-creative context without extensive per-domain data collection. It can encounter the practitioner's idiosyncratic vocabulary — their habitual spirals, their signature timing, their personal relationship with gravity — and track it, without needing to have been specifically trained on it.
This is not the same as feeling it. But it is the prerequisite for responding to it. A system that fails to track unfamiliar movement cannot be a creative partner. A system that tracks it reliably opens the door to the next question: what kind of response does it generate? And that is where the somatic intelligence of the practitioner becomes the essential ingredient — not replaced by the AI, but in dialogue with it.
What the Baby Knows That the Model Does Not
Back to the infant. The infant falls two billion fewer times than the AI model processes frames, and yet within a year is walking with proprioceptive precision that current AI systems cannot match. Why?
Because the infant is learning from the inside. Every fall is felt. Every correction is registered as a change in the felt relationship between intention and outcome. The infant's nervous system is building a forward model — a prediction of what movement will feel like — from first-person proprioceptive experience, not from third-person observation of other bodies moving.
The AI model is learning from the outside. It is learning what movement looks like, from the perspective of a sensor recording position data across time. It is not learning what movement feels like, because it has no felt dimension.
This is the gap that wearable sensing research is trying to close. EMG (electromyography) and inertial measurement units can capture some of the proprioceptive and kinetic signals that are invisible to cameras — the muscular pre-activation before a movement begins, the acceleration patterns that correspond to effort quality, the neural signals that carry intention before it is expressed as visible action. Systems trained on these signals are closer to the infant's inside-out learning than the model trained on video.
Not identical — there is still no felt experience, no nervous system that experiences the proprioceptive consequence of its own actions. But closer. And in a field moving as fast as this one, closer matters.
Further Reading
Qi, Z., et al. (2026). Humanoid-GPT: Scaling data and structure for zero-shot motion tracking. arXiv:2606.03985. https://arxiv.org/abs/2606.03985
Fuchs, T. (2018). Ecology of the brain: The phenomenology and biology of the embodied mind. Oxford University Press.