This Week in Motion AI: Intent as the Bridge, and the Hierarchy of Movement

Week of 23–29 June 2026 — physics-based control learns to speak intent, and generation learns to separate the what from the how


1. Intent as the Missing Middle: MIND

MIND: Multi-Scale Intent Diffusion for Text-Driven Physics-Based Humanoid Control Li, B., Zhang, R., Liang, H., Zhang, J., Zhang, J., Chen, X., & Wang, J. (ShanghaiTech). arXiv:2605.26006. https://arxiv.org/abs/2605.26006

The problem: Getting a physically simulated body to execute a movement described in words ("walk forward and turn sharply left") has faced a stubborn gap. Two-stage methods first generate the movement kinematically, then have a physics controller track it — but the controller sees a domain it wasn't trained on. End-to-end methods learn to map text directly to low-level muscle-or-motor actions — but the gap between a sentence and a torque command is enormous, and learning across it is unstable.

The approach: MIND identifies the missing middle. Its key insight: the state of a body in motion (its posture, velocity, dynamics) is far more semantically aligned with a text description than the low-level actions (forces, torques) that produce it. So MIND introduces behavioural intent — a mid-level representation of what the body is doing — as a semantic bridge between language and control. A multi-scale diffusion framework generates intent at several temporal resolutions, and control follows from intent rather than leaping directly from text.

Why it matters for somatic AI: The "intent as bridge" structure echoes a somatic truth. Movement is not organised directly from abstract instruction to muscle command; it is organised through intention — the felt sense of what one is about to do, which then shapes the muscular execution. MIND's intermediate intent layer is a computational rhyme with motor intentionality (cf. the May 1 deep analysis on Merleau-Ponty). The system works better when it models the intending, not just the acting.


2. Scaling Language-Driven Physical Control: SCRIPT

SCRIPT: Scalable Diffusion Policy with Multi-Stage Training for Language-Driven Physics-Based Humanoid Control (2026). arXiv:2605.22894. https://arxiv.org/abs/2605.22894

The approach: SCRIPT jointly models actions, physical states, and language in a single architecture (JAST-DiT), adding history conditioning and reinforcement-learning post-training to learn closed-loop control from large-scale physically executable trajectories. Crucially, it shows consistent gains with model scaling — the language-to-physical-control problem responds to the same scale-up logic that has driven progress elsewhere.

Why it matters: Alongside MIND, SCRIPT signals that physically grounded, language-driven movement control is becoming both more capable and more scalable. For somatic AI, physical grounding matters: a co-creative system whose generated movement respects physical law (weight, balance, momentum) is a system whose output a practitioner's body can meaningfully respond to. Physically implausible generation breaks the somatic dialogue.


3. Separating the What from the How: DC-Motion

DC-Motion: Decoupling Semantics and Details via Discrete-Continuous Tokens for Human Motion Generation (2026). arXiv:2606.14721. https://arxiv.org/abs/2606.14721

The problem: Human movement is hierarchical — it has high-level semantic structure (what movement this is, its temporal layout) and fine-grained detail (joint smoothness, local dynamics, the micro-texture of execution). Models using a single uniform representation struggle: diffusion models handle the semantics but blur the detail; autoregressive models with quantised tokens capture structure but lose fine physical detail to quantisation error.

The approach: DC-Motion factorises movement into two token types. A Discrete-Continuous VAE encodes high-level semantics as discrete tokens (the compositional structure, the temporal layout) and fine-grained dynamics as continuous residuals (joint smoothness, local texture) — avoiding the irreversible information loss of pure quantisation while keeping the compositional benefits of discrete structure.

Why it matters for somatic AI: The discrete/continuous decomposition maps suggestively onto a somatic distinction: the nameable structure of a movement (its identifiable form, its phrasing) versus its felt texture (the continuous, hard-to-quantise quality of how it is executed). DC-Motion keeps the fine continuous residual precisely because that is where the quality lives — an architectural acknowledgement, in a mainstream generation paper, that the quantisable structure is not the whole of movement.


APA References

Li, B., Zhang, R., Liang, H., Zhang, J., Zhang, J., Chen, X., & Wang, J. (2026). MIND: Multi-scale intent diffusion for text-driven physics-based humanoid control. arXiv:2605.26006. https://arxiv.org/abs/2605.26006

SCRIPT: Scalable diffusion policy with multi-stage training for language-driven physics-based humanoid control. (2026). arXiv:2605.22894. https://arxiv.org/abs/2605.22894

DC-Motion: Decoupling semantics and details via discrete-continuous tokens for human motion generation. (2026). arXiv:2606.14721. https://arxiv.org/abs/2606.14721