This Week in Motion AI: In-betweening, Co-Speech Avatars, and Billion-Parameter Motion Models

Week of 12–18 May 2026 — three developments that push motion generation toward continuity, expression, and scale


1. Filling the Gaps: Diffusion Models Learn to In-between from Sparse Keyframes

Generative Motion In-betweening by Diffusion over Continuous Implicit Representations Fan, S., Henderson, P., & Ho, E. S. L. (2026, May 12). arXiv:2605.12778. https://arxiv.org/abs/2605.12778

The problem: Motion in-betweening — generating the movement that connects two keyframe poses — is a foundational task for animation and interactive performance tools. The challenge is that the keyframes are often sparse (a starting pose and an ending pose, nothing in-between) and ambiguous (many physically plausible paths connect any two poses). Most existing methods either preserve keyframe accuracy poorly or sacrifice diversity by collapsing to a single most-likely solution.

The approach: A team at the University of Glasgow proposes encoding motion sequences as implicit neural representations (INRs) — continuous functions that can represent a motion phrase at arbitrary temporal resolution — and training a latent diffusion model to sample the parameters of these INR functions from sparse, ambiguous keyframe constraints. Because the INR represents motion continuously rather than as a fixed sequence of frames, the diffusion model learns to fill the temporal gap in a physically coherent way without being locked to a discrete frame rate.

Results: The method substantially outperforms existing approaches on motion quality in sparse-keyframe scenarios while maintaining diversity across generated samples. Notably, the INR representation means the generated motion can be queried at any time point, not just at training-frame intervals — producing smooth, artefact-free transitions even between widely separated keyframes.

Why it matters for somatic-AI practice: In-betweening is the computational analogue of what a choreographer does when they set two anchor moments in a phrase and trust the mover's body to find the path between them. A diffusion-based in-betweening system that generates plausible, diverse transitions from sparse constraints is directly useful for score-based co-creation: the practitioner provides the moments of arrival; the AI proposes the movement in-between. The INR's continuous temporal representation is particularly suited to somatic work, where movement quality is determined by the arc of the phrase rather than the positions of individual frames.


2. Speaking with the Whole Body: Unified Sparse Motion for Real-Time Co-Speech Avatars

UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars Zhan, X., Fu, X., Yang, C., et al. (2026, May 14). arXiv:2605.14731. https://arxiv.org/abs/2605.14731

The problem: Co-speech motion generation — generating body gesture and movement that accompanies speech — has been treated as a specialised task requiring separate pipelines for audio–motion alignment, facial expression, and body gesture. Systems that excel at one modality tend to compromise on others, and real-time performance (the requirement for live virtual avatar interaction) has been difficult to achieve alongside high motion quality.

The approach: UMo processes text, audio, and motion tokens within a single, unified sparse-motion formulation. By treating all three modalities as sequences of discrete tokens and training a transformer to model their joint distribution, the architecture achieves tight audio–motion alignment without separate alignment modules, while the sparse representation maintains the throughput needed for real-time generation.

Why it matters: The integration of text, audio, and movement into a single token space is a step toward the kind of multi-channel conditioning that somatic-AI interfaces require. A practitioner who speaks, moves, and vocalises simultaneously produces all three signals; a system that can condition on their joint distribution — rather than treating each as a separate input — is better positioned to respond to the practitioner as a whole expressive agent rather than as a set of independent data streams.


3. Motion at Billion-Parameter Scale: HY-Motion 1.0

Tencent HY-Motion 1.0 (HuggingFace release) tencent/HY-Motion-1.0. https://huggingface.co/tencent/HY-Motion-1.0

What it is: Tencent's HY-Motion 1.0 is a series of text-to-3D human motion generation models based on Diffusion Transformer (DiT) architecture and flow matching, scaled to the billion-parameter level. The models generate skeleton-based 3D character animations directly from text prompts. HY-Motion is the first open-source text-to-motion model family to operate at this parameter scale.

Significance of scale: The shift to billion-parameter motion models mirrors the trajectory that language and image generation followed: larger models trained on broader data produce better instruction-following, more expressive motion vocabularies, and stronger generalisation to unusual prompts. HY-Motion's public release makes billion-parameter motion generation accessible to the research community for the first time — lowering the barrier for fine-tuning on specialist datasets such as somatic practice recordings or Contact Improvisation archives.

Practical note: The model generates skeleton sequences (joint positions over time), not video. Downstream rendering into visual output requires a separate mesh deformation or video synthesis step — but the skeleton representation is directly compatible with pose-based analysis pipelines and motion capture workflows.


APA References

Fan, S., Henderson, P., & Ho, E. S. L. (2026). Generative motion in-betweening by diffusion over continuous implicit representations. arXiv:2605.12778. https://arxiv.org/abs/2605.12778

Tencent. (2026). HY-Motion-1.0 [Model repository]. HuggingFace. https://huggingface.co/tencent/HY-Motion-1.0

Zhan, X., Fu, X., Yang, C., et al. (2026). UMo: Unified sparse motion modeling for real-time co-speech avatars. arXiv:2605.14731. https://arxiv.org/abs/2605.14731