Three significant developments from the week of 28 April – 4 May 2026 in motion manifold learning, generative video, and motion-conditioned generation.


1. Physics-Supervised Conditioning for Generative Motion Video

PhyCo: Learning Controllable Physical Priors for Generative Motion Narayanan, S., Jiang, Z., Narasimhan, S., & Chandraker, M. (2026, April 30). PhyCo: Learning Controllable Physical Priors for Generative Motion. arXiv. https://arxiv.org/abs/2604.28169

Standard video generation models produce visually plausible motion by learning the statistical distribution of natural images across time. They have no explicit representation of physical properties — friction, mass, restitution — which are the actual causal forces governing how bodies and objects move. PhyCo addresses this by fine-tuning a diffusion-based video generation model on over 100,000 photorealistic simulation videos in which these physical parameters are systematically varied. The result is a model where friction, elasticity, and applied force become continuous, interpretable conditioning variables — a mover can specify not just "walk slowly" but the physical character of the walking surface and the gravitational context.

Field relevance: This is a significant step toward physically grounded generative video. For somatic AI co-creation pipelines, physically plausible motion is a necessary but not sufficient condition — the next question is whether physically consistent generation can be conditioned on the qualities of the mover's own felt body rather than on simulated physics parameters. PhyCo establishes the conditioning architecture; the open problem is what the conditioning variables should be for somatic-quality generation. Source tier: Tier 3 (arXiv preprint).


2. Predicting Video Rather Than Completing It

Video Generation with Predictive Latents Zhao, Y., et al. (2026, May 4). Video Generation with Predictive Latents. arXiv. https://arxiv.org/abs/2605.02134

Most video generation models are trained on a reconstruction objective: given some frames, reproduce them as accurately as possible. This trains the model to be a very good compressor of what it has seen. Predictive Latents proposes a different objective — the decoder must not only reconstruct observed frames but simultaneously predict future frames that have not been shown. This predictive reconstruction objective forces the model's latent representations to encode the causal structure of the sequence, not just its appearance. The result is improved temporal consistency in generated sequences and better generalisation to motion dynamics not seen during training.

Field relevance: The core claim — that predictive latents encode causal structure rather than just appearance — is directly relevant to the problem of motion-conditioned generation. A world model that has learned causal motion structure can be conditioned on partial body state (e.g., the beginning of a movement phrase) and extrapolate the rest in a physically and kinetically coherent way. This is architecturally closer to the kind of forward-model capacity a real-time somatic-to-visual pipeline needs. Source tier: Tier 3 (arXiv preprint, submitted May 4, 2026).


3. Physics-Aware Multi-Body Interaction Synthesis

InterPhys: Physics-Aware Human Motion Synthesis in a Dynamic Scene Xing, C., Mao, W., & Liu, M. (2026, May 1). InterPhys: Physics-aware human motion synthesis in a dynamic scene. arXiv. https://arxiv.org/abs/2605.01036

Generating realistic human movement in interaction with objects or other people requires more than kinematic plausibility — it requires modelling the forces being exchanged at contact points. InterPhys explicitly represents human-object and human-scene contact forces in its synthesis pipeline, generating motions where the body's dynamics respond physically to what it is touching. The system is trained and evaluated on environments with multiple interacting agents and dynamic obstacles.

Field relevance: Contact-aware synthesis is the unsolved core problem for duo and ensemble dance generation, and for somatic practices (Contact Improvisation, martial arts partnering) where the quality of physical exchange is the primary expressive material. Most current motion generation models treat contact as a positional constraint; InterPhys treats it as a force exchange, which is the correct physical framing. Source tier: Tier 3 (arXiv preprint, submitted May 1, 2026).