The Word, the Muscle, and the Missing Middle

Why AI struggles to turn "dance joyfully" into actual movement — and what the gap reveals about how bodies really work.


Tell a person to "reach up as if catching a falling leaf," and something remarkable happens. Without any conscious calculation, their whole body organises itself: the weight shifts, the spine lengthens, the arm rises with a particular delicacy, the hand softens in anticipation of the imagined leaf. From a handful of words, a fully organised, quality-laden movement emerges — instantly, effortlessly, without the person ever thinking about which muscle to fire.

Now ask an AI to do the same thing, and you run straight into one of the field's hardest problems. Because between the words and the movement lies a gap so wide that bridging it directly turns out to be nearly impossible. Understanding why reveals something deep about how movement actually works — in bodies and, increasingly, in machines.


The Size of the Gap

Think about what actually separates a spoken instruction from a physical movement.

At the top: language. "Reach up as if catching a falling leaf." Abstract, symbolic, compressed. A few words standing in for a whole world of meaning.

At the bottom: the physical commands that actually move a body. In a real person, these are the electrical signals firing hundreds of individual muscles in precisely timed patterns. In a simulated body, they are the forces and torques applied at each joint, recalculated dozens of times per second.

The distance between these two levels is staggering. A short phrase has to become thousands of precisely coordinated micro-commands. And there is no obvious direct route from one to the other. How do you get from the idea of catching a leaf to the exact force the shoulder muscle should apply in the next fraction of a second?

For years, AI systems tried to bridge this gap in one of two clumsy ways. Some generated the movement's appearance first, then handed it to a separate system to figure out the physics — but the two systems spoke different languages and the handoff was brittle. Others tried to learn the whole leap in one go, mapping words directly to muscle-level commands — but the gap was so wide that learning across it was unstable and unreliable. Neither worked well.


The Discovery of the Missing Middle

This year, several research teams converged on the same insight, and it is one that movement practitioners will find deeply familiar. The reason you can't jump directly from word to muscle is that bodies don't work that way either. There is a middle layer, and it is the key to everything.

A new system called MIND (from researchers at ShanghaiTech) named this middle layer intent. The insight is subtle but profound: the state of a body in motion — its posture, its momentum, the shape of what it's doing — is much closer in meaning to a spoken description than the raw muscle commands are. "Catching a falling leaf" doesn't map onto forces and torques, but it does map onto a recognisable state of the reaching body: the lifted, softened, anticipatory shape. And once you have that intended state, the specific muscle commands to achieve it become a much more tractable problem.

So the architecture that works is not word → muscle. It is word → intent → muscle. The intent — the felt, mid-level sense of what the body is about to do — is the bridge. Skip it, and the gap is unbridgeable. Include it, and movement becomes generable.


Why a Dancer Already Knew This

If this sounds obvious to you, there's a good chance you have a movement practice — because this is exactly how skilled movement is organised in the body, and it is exactly what somatic training cultivates.

A trained mover does not execute a phrase by consciously commanding individual muscles. Nor do they leap from an abstract idea straight to physical action without anything in between. They work through intention — the felt, anticipatory sense of the movement they are about to make, which then organises the muscular execution below conscious awareness. The philosopher Maurice Merleau-Ponty called this "motor intentionality": the body's own way of reaching toward a movement possibility, below the level of explicit thought and above the level of individual muscle control.

The whole art of much somatic practice lives in this middle layer. When a teacher says "initiate the movement from your centre" or "let the gesture arrive before you complete it," they are working with intent — shaping the intermediate layer where a movement is organised before it is muscularly executed. Practitioners spend years refining their sensitivity to and control over this exact layer that AI has just discovered it cannot do without.

There is something quietly striking in this convergence. Engineers, working purely from the practical problem of making AI movement work, arrived independently at the same three-layer structure — abstract intention, mid-level intent, physical execution — that phenomenologists described from lived experience and that somatic practitioners have trained within for a century. When two completely different investigations arrive at the same structure, it is usually because the structure is real.


What This Means Going Forward

The discovery of the "missing middle" points somatic-AI collaboration toward its most promising territory.

If intent is the layer where movement is really organised — in bodies and now in machines — then it is also the layer where a human practitioner and an AI system could most meaningfully meet. Not at the level of abstract words (too coarse to carry movement quality) and not at the level of raw muscle signals (too low to be shared), but at the intermediate level of intent, where the felt shape of a movement lives.

This is also why the sensing technologies this series has tracked matter so much. Signals like EMG — which read muscle activity, including the anticipatory pre-activation that precedes visible movement — are, in effect, windows onto the intent layer. They catch the body organising its movement in the middle zone, before the movement is complete and while it is still forming. An AI that could sense a practitioner's intent as it forms, and respond at that same level, would be meeting the mover where movement actually happens.

The word is too high. The muscle is too low. Movement lives in the middle — and that middle, it turns out, is exactly where somatic knowledge has always worked.


Further Reading

Li, B., et al. (2026). MIND: Multi-scale intent diffusion for text-driven physics-based humanoid control. arXiv:2605.26006. https://arxiv.org/abs/2605.26006

Merleau-Ponty, M. (1962). Phenomenology of perception (C. Smith, Trans.). Routledge. (Original work published 1945)