This Week in Motion AI: Retargeting Bodies, Organising Motion, and Unifying Perception with Generation

Week of 19–25 May 2026 — datasets reach new scale, retargeting gets geometry-aware, and perception meets generation in one model


1. Motion That Fits: Geometry-Aware Retargeting Across Disparate Body Shapes

Skinned Motion Retargeting with Spatially Adaptive Interaction Guidance KAIST Visual Media Lab. (2026, May 19). arXiv:2605.19355. https://arxiv.org/abs/2605.19355

The problem: Motion retargeting — transferring a movement sequence from one body to another — is a core operation in animation, robotics, and somatic AI. The challenge is that movement is shaped by the body that produces it: a phrase natural for a long-limbed, tall body may be biomechanically strained on a short, differently proportioned one. Geometry-aware retargeting methods that try to preserve spatial relationships between body regions struggle when target characters have exaggerated or significantly different proportions, because the correspondence between "corresponding regions" breaks down.

The approach: A team at KAIST's Visual Media Lab proposes spatially adaptive interaction guidance — a Transformer-based framework that identifies proximity anchors (points on the body that interact with each other or with the environment) and predicts anchor displacements specific to the target body's geometry. Displaced anchors are then projected back onto the target character's surface via differentiable soft projection, preserving the interaction semantics of the original motion (self-contact, near-body proximity, object contact) without requiring fixed correspondence maps.

Why it matters: The result is motion retargeting that respects what a movement is doing — its functional and expressive relationships — rather than just mapping joint positions. For somatic AI, this is directly relevant to the identity-aware generation problem: movement cannot simply be mapped from a training distribution of bodies to a specific practitioner's body using geometric transformations. The functional and expressive content of the movement must be preserved through the transfer. Spatially adaptive guidance is a step toward retargeting that is semantically faithful, not just positionally close.


2. A Taxonomy for the Full Range of Human Movement

RoMo: A Large-Scale, Richly Organised Dataset and Semantic Taxonomy for Human Motion Generation Researchers from ANU, Roblox, Stanford University, and Rutgers University. (2026, May 25). arXiv:2605.26241. https://arxiv.org/abs/2605.26241

The problem: Text-to-motion generation has been limited by its training data. Existing 3D human motion datasets offer either high fidelity (motion capture, small scale, lab conditions) or large scale (in-the-wild video, lower quality, dominated by static or low-movement sequences). Global evaluation metrics obscure where models are strong and where they fail: a model that excels at locomotion may be poor at fine manipulation, but its overall FID score may still be competitive.

The approach: RoMo introduces a three-level semantic taxonomy that organises human movement from broad categories (locomotion, manipulation, expressive gesture) down to fine-grained subcategories, paired with a taxonomy-aware filtering pipeline that removes static and artefact-prone sequences. Every sequence in the dataset receives detailed captions and a taxonomic label. The hierarchical structure enables per-category evaluation, revealing model strengths and weaknesses that global metrics hide.

Results: Models trained on RoMo achieve state-of-the-art fidelity and diversity on standard benchmarks, with substantially improved handling of complex, subtle text prompts — particularly in categories underrepresented in prior datasets.

Why it matters for somatic AI: The taxonomy problem in motion data is the same problem that somatic practitioners have always faced: how do you categorise and describe movement in a way that captures qualitative distinctions rather than just surface appearances? RoMo's three-level hierarchy is a step toward a data infrastructure that could support training models on qualitatively distinct movement categories — including the subtle, effort-quality-driven distinctions that somatic practice works with.


3. Perception Meets Generation: Superman Unifies Skeleton and Vision

Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation Wang, X., Li, P., Wang, Z., et al. (Peking University, Sun Yat-sen University, Sony R&D, NTU). (2026, February). arXiv:2602.02401. https://arxiv.org/abs/2602.02401accepted at CVPR 2026

The problem: The field of human motion AI has historically been fragmented between two model families. Perception models (pose estimation, motion prediction from video) process visual input but output descriptors or text, not movement sequences. Generation models produce movement sequences but cannot process raw visual input — they require pre-processed skeleton data or text prompts. The two families share the same subject matter but cannot communicate: a perception model that observes motion cannot directly hand off to a generation model that continues it.

The approach: Superman introduces a Vision-Guided Motion Tokenizer that builds a unified cross-modal vocabulary grounded in the geometric alignment between 3D skeletons and visual data. By tokenising both visual observations and skeletal motion sequences into the same vocabulary, a single transformer model can handle 3D pose estimation, motion prediction, and motion in-betweening — switching fluidly between perception and generation tasks within a unified architecture.

Results: Superman achieves state-of-the-art performance across all three tasks and will be presented at CVPR 2026 (Denver, June 2026). It is among the first models to demonstrate that the perception-generation boundary is not architecturally necessary.

Why it matters: The somatic practitioner-AI dialogue that SSIN research envisions requires exactly this kind of boundary removal: the AI system must observe the practitioner's movement (perception), build an internal model of its qualities and dynamics, and generate a responsive continuation (generation) — in one coherent process. Superman's unified architecture is a proof of concept that this is computationally feasible.


APA References

KAIST Visual Media Lab. (2026). Skinned motion retargeting with spatially adaptive interaction guidance. arXiv:2605.19355. https://arxiv.org/abs/2605.19355

Wang, X., Li, P., Wang, Z., Fang, Z., Deng, Z., Wu, S., Li, J., & Liu, M. (2026). Superman: Unifying skeleton and vision for human motion perception and generation. arXiv:2602.02401. https://arxiv.org/abs/2602.02401

[RoMo authors]. (2026). RoMo: A large-scale, richly organised dataset and semantic taxonomy for human motion generation. arXiv:2605.26241. https://arxiv.org/abs/2605.26241