I now have three verified papers within the April 7–13, 2026 window. Composing the report.


Scout Report: April 7–13, 2026

Motion Manifold Learning · Generative Video · Motion-Conditioned Generation


1. Motion Manifold Learning

CMP: Robust Whole-Body Tracking for Loco-Manipulation via Competence Manifold Projection arXiv:2604.07457 — Submitted April 8, 2026

This paper introduces CMP, a framework for robust whole-body control in legged mobile manipulators that tackles out-of-distribution command failures through three interlocking mechanisms. First, a frame-wise safety check converts arbitrary commanded poses into single-step manifold membership queries. Second, a safety estimator classifies whether an incoming command lies within the system's "competence manifold" — the learned boundary of feasibly executable motion. Third, the authors construct an isomorphic latent space that explicitly aligns its geometric structure with the safety-probability field, ensuring that proximity in latent space corresponds directly to proximity in achievability. The system is validated on physical hardware and achieves up to a 10× improvement in survival rate against out-of-distribution commands while maintaining tracking fidelity on in-distribution tasks. Although the primary application domain is robotics locomotion-manipulation, the paper's core contribution — learning a geometry-preserving latent space that encodes structural feasibility of motion — is a direct methodological advance for motion manifold learning applicable across embodied agents.

Field relevance: Introduces a feasibility-aware latent manifold grounded in geometry; advances the question of which motions are learnable within a given body model, a problem central to motion representation research.

Source tier: Tier 3 (arXiv preprint, cs.RO)

Reference: Cheng, Z., Wei, H., Yin, H., Xu, X., Yu, B., Zhou, J., & Lu, J. (2026). CMP: Robust whole-body tracking for loco-manipulation via competence manifold projection. arXiv. https://arxiv.org/abs/2604.07457


2. Generative Video

Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation arXiv:2604.10030 — Submitted April 11, 2026

Video diffusion models struggle with semantic entanglement when a single generation must express a sequence of distinct events: concepts from one temporal segment bleed into adjacent ones, producing muddy transitions and misaligned timing. Prompt Relay addresses this without architectural modification or retraining by operating at inference time on the cross-attention layer. The method introduces a temporal penalty that constrains each time segment's attention map to concentrate on its designated text prompt, actively suppressing cross-segment leakage. The result is that each semantic event occupies its intended temporal window while transitions remain visually coherent. Authors Chen, Huang, and Liu (Nanyang Technological University) demonstrate improvements on multi-event benchmarks while preserving single-event generation quality. Because the mechanism is post-hoc and training-free, it is applicable to any video diffusion architecture as a drop-in inference wrapper.

Field relevance: Solves temporal semantic entanglement in video diffusion — a fundamental bottleneck for any application requiring structured multi-phase motion sequences in generated video.

Source tier: Tier 3 (arXiv preprint, cs.CV)

Reference: Chen, G., Huang, Z., & Liu, Z. (2026). Prompt relay: Inference-time temporal control for multi-event video generation. arXiv. https://arxiv.org/abs/2604.10030


3. Motion-Conditioned Generation

CDAMD: Coordinate-Based Dual-Constrained Autoregressive Motion Generation arXiv:2604.08088 — Submitted April 9, 2026

CDAMD directly addresses two known failure modes in text-to-motion generation: error amplification during diffusion noise prediction, and mode collapse from discrete motion tokenisation in autoregressive models. The framework takes raw motion coordinates as input (bypassing rotation-space representations entirely, which sidesteps gimbal lock and representation ambiguity) and operates autoregressively while embedding diffusion-inspired multi-layer perceptrons inside each generation step to sharpen fidelity. The central innovation is the Dual-Constrained Causal Mask (DCCM): motion tokens from preceding frames act as kinematic priors that are concatenated with textual encodings, so each prediction is simultaneously constrained by (1) semantic intent from language and (2) physical continuity from motion history. Authors Ding, Wang, Gui, and Wang report state-of-the-art results on HumanML3D benchmarks for both generation fidelity and semantic consistency. The motion-editing experiments further demonstrate controllable post-hoc manipulation of generated sequences.

Field relevance: The coordinate-based input with dual semantic-kinematic constraints is a meaningful architectural step toward motion generators that honour both text intent and physical plausibility simultaneously.

Source tier: Tier 3 (arXiv preprint, cs.CV)

Reference: Ding, K., Wang, H., Gui, J., & Wang, L. (2026). Coordinate-based dual-constrained autoregressive motion generation. arXiv. https://arxiv.org/abs/2604.08088


Verification note: All three papers were confirmed via direct arXiv abstract page fetch with exact UTC submission timestamps. No items from this report are flagged 🚩. Papers published in peer-reviewed venues (Tier 1–2) with submission or acceptance dates in this window were not identified; the three items above are all Tier 3 preprints.