I have verified three papers with confirmed submission dates in the March 24–30, 2026 window. Here is the full report.


Frontier Scout Report — Week of 2026-03-24 to 2026-03-30

Development 1 — Motion-Conditioned Generation

InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance

Summary. Pan et al. introduce a framework for synthesizing naturalistic two-person interaction videos from speech audio, addressing a persistent gap in motion-conditioned video generation: the conditioning signal must encode not just one body's motion but the relational dynamics between two agents. The system comprises three novel components: an Interactivity Injector that performs identity-agnostic video reenactment from reference motion patterns; MetaQuery Alignment, a multimodal language model bridge that determines when and how reactive motion should occur given audio context; and Role-aware Dyadic Gaussian Guidance (RoDG), which enforces lip synchronization and spatial consistency during extreme head poses. A new evaluation suite with dyadic-specific metrics accompanies the method. Results show significant outperformance over prior speech-to-video baselines on both motion naturalness and interactional appropriateness.

Field relevance. Directly advances motion-conditioned video generation by treating relational motion as a first-class conditioning signal. The MetaQuery bridging strategy is a meaningful departure from single-body skeleton conditioning, with implications for any generative system that must model co-regulated, multi-agent body behavior.

Source tier. Tier 3 — arXiv preprint

Reference. Pan, D., Guo, L., Guan, J., Huang, L., Li, Y., Liu, H., Feng, H., He, W., Wang, K., & Zhou, H. (2026). InterDyad: Interactive dyadic speech-to-video generation by querying intermediate visual guidance. arXiv. https://arxiv.org/abs/2603.23132


Development 2 — Generative Video

PhysVid: Physics Aware Local Conditioning for Generative Video Models

Summary. Pathak et al. address a systemic failure mode in contemporary video diffusion models: despite impressive visual fidelity, they routinely violate physical laws — objects pass through surfaces, fluids defy gravity, rigid bodies spontaneously deform. PhysVid introduces physics-aware local conditioning that annotates each temporal chunk of a video with natural-language descriptions of the physical states and interactions present (e.g., contact forces, material behavior, trajectory arcs). These descriptions are fused into the generation process via chunk-aware cross-attention, which integrates them alongside the global text prompt during training. At inference, negative physics prompts are deployed to steer the diffusion trajectory away from physically implausible configurations. The approach yields approximately 33% improvement on VideoPhy physical plausibility benchmarks and up to 8% gains on VideoPhy2, without requiring explicit physics simulation or 3D scene graphs.

Field relevance. Establishes a scalable, language-mediated pathway for grounding generative video in physical constraint satisfaction. The chunked, local-conditioning paradigm is particularly relevant for human body motion synthesis, where physical plausibility — ground contact, joint torque feasibility, momentum continuity — is both perceptually salient and systematically underserved by current models.

Source tier. Tier 3 — arXiv preprint

Reference. Pathak, S., Arani, E., Pechenizkiy, M., & Zonooz, B. (2026). PhysVid: Physics aware local conditioning for generative video models. arXiv. https://arxiv.org/abs/2603.26285


Development 3 — Motion Manifold Learning / Generative Synthesis

Heracles: Bridging Precise Tracking and Generative Synthesis for General Humanoid Control

Summary. Tao et al. (16 authors) propose Heracles, a state-conditioned diffusion middleware that reframes the longstanding dichotomy in motion generation between strict reference tracking and open-ended generative synthesis. The core insight is that a diffusion model conditioned on real-time robot state information can implicitly interpolate between two operating regimes: when the current state closely aligns with the reference motion, the model approximates an identity map, preserving tracking precision; when significant deviations arise (due to perturbation, physical contact, or unexpected state), the model transitions into a generative synthesizer that produces natural, anthropomorphic recovery trajectories. This adaptive behavior emerges without explicit mode-switching, arising instead from the structure of the learned diffusion prior conditioned on state proximity. The paper positions humanoid control as an open-ended generative problem rather than a rigid constraint satisfaction task, with implications for generalization across unseen motions and environments.

Field relevance. Directly advances motion manifold learning by demonstrating that a diffusion prior over the motion manifold can serve as a continuous, state-reactive medium — not just an offline generator. The identity-to-synthesis gradient is a conceptually important contribution: it operationalizes the idea that a learned motion manifold should support both faithful reproduction and creative deviation, depending on context distance from known reference trajectories.

Source tier. Tier 3 — arXiv preprint

Reference. Tao, Z., Su, Z., Liu, P., Sun, J., Que, W., Ma, J., Yu, J., Cao, J., Sun, P., Liang, H., Han, G., Zhao, W., Xu, Z., Tang, J., Zhang, Q., & Guo, Y. (2026). Heracles: Bridging precise tracking and generative synthesis for general humanoid control. arXiv. https://arxiv.org/abs/2603.27756