Manifold-Aware Motion Conditioning in Generative AI
A Frontier Report
1. The Keypoint Paradigm and Its Geometric Limits
The dominant conditioning paradigm in human motion generation treats pose as a flat array of joint positions—an vector for joints. The convenience is obvious: keypoints are easily supervised, trivially differentiable, and require no domain knowledge of kinematics. But this representation carries a fundamental misalignment with the true geometry of motion.
Human pose occupies a highly structured, curved manifold within ambient Euclidean space. The space of physically valid rotations for a single joint is the Lie group SO(3); the full body constitutes a product of such groups modulated by a kinematic tree structure—a compact, non-Euclidean manifold with nontrivial topology. When a model is conditioned on a raw keypoint vector, it implicitly assumes that interpolating between two poses via straight Euclidean paths is geometrically meaningful. It is not. Euclidean midpoints of joint positions can pass through physically impossible configurations—limbs penetrating the torso, joints inverting beyond anatomical range—because the straight line between two points in ambient space does not stay on the curved manifold of valid poses.
Zhou et al. (2019) exposed this problem formally in the context of rotation regression: continuous rotation representations in (Euler angles) or unit quaternions () are either topologically inadequate or introduce antipodal ambiguity. Their 6D representation—derived from the first two columns of a rotation matrix, the minimal continuous embedding of SO(3) in Euclidean space—demonstrates that representation geometry is not decorative; it directly determines what a network can and cannot learn to generalize.
Beyond single-frame pose, sequential motion imposes additional structure. A motion sequence is a trajectory through pose space, and a conditioning signal derived from that trajectory should encode the intrinsic geometry of the path, not its accidental Euclidean embedding. Holden et al. (2016) showed that a learned motion manifold could support phase-functioned synthesis far better than direct keypoint regression. But even that framework mapped motion into a flat latent space without enforcing manifold structure within that space. The research frontier is to close that gap.
2. Riemannian Approaches to Motion Representation
The natural language for curved motion spaces is Riemannian geometry. A Riemannian manifold equips each tangent space with a smoothly varying inner product , enabling proper notions of distance (geodesic length), parallel transport, and curvature.
For rotation, SO(3) is a compact Riemannian manifold. The exponential map takes tangent vectors—infinitesimal rotations from the Lie algebra —to finite rotations, and the logarithmic map inverts this locally. Geodesic interpolation, i.e., spherical linear interpolation (SLERP), on SO(3) thus provides the intrinsically correct smooth path between two orientations, as opposed to the linear interpolation that keypoint-level models implicitly perform.
The full-body pose manifold is more complex. Loper et al. (2015) introduced SMPL, parametrizing pose as axis-angle vectors per joint—a local Lie algebra parametrization that is more principled than raw coordinates but still requires careful normalization to avoid discontinuities at boundaries. A fully Riemannian treatment would place a metric directly on the product manifold of joint rotations and propagate all signals through that metric.
Recent work on Riemannian generative models has demonstrated that stochastic processes can be defined intrinsically on non-Euclidean spaces. De Bortoli et al. (2022) showed that score-based diffusion—defining a forward noising process and reverse denoising process via the Riemannian Brownian motion on a compact manifold—enables both sampling and conditioning to respect the manifold's geometry end-to-end. The score function itself becomes a section of the tangent bundle rather than a gradient in , and the denoising trajectory follows geodesics rather than Euclidean drift.
A practically critical consequence: conditioning signals computed via geodesic distances are interpolatable without artifact. A signal encoding "the midpoint of the geodesic between pose and pose " encodes a semantically valid pose in the interior of the manifold, not an averaged keypoint cloud that may fall outside the valid region. Downstream generation guided along geodesic paths cannot, by construction, pass through anatomically invalid configurations.
3. Disentangled Motion Representations: Style and Content
A complementary axis of inadequacy in keypoint conditioning is semantic conflation. A keypoint sequence bundles together at minimum two independent sources of variation: what movement is being performed (locomotion, reach, gesture) and how it is performed (individual idiosyncrasy, energy, temporal dynamics). These modes reside on distinct submanifolds—or in distinct factor dimensions—of the full motion space.
Aberman et al. (2020) demonstrated that motion style and content can be disentangled in a neural representation without paired training data. Their framework separates kinematic content—the joint velocity field abstractly describing a motion type—from stylistic character that inflects it. Once disentangled, content from one subject can be retargeted with the style of another, enabling motion analogy operations that are nonsensical in flat keypoint space.
From a conditioning standpoint, this matters architecturally. A conditioning signal carrying undifferentiated keypoints provides no interface for specifying "perform this action with that quality." A manifold-aware representation, by contrast, exposes separate handles on the content manifold and the style manifold, enabling compositional conditioning: direction vectors in the action subspace crossed independently with direction vectors in the quality subspace. The manifold structure provides the correct topology for such composition—where the joint space (content, style) admits independent traversal without coupling.
This also connects to the concept of motion primitives: compact basis elements of the action content manifold. A learned content manifold reflecting such factorization would expose conditioning interfaces aligned with the degrees of freedom that human motor control actually uses—smooth, low-dimensional, combinatorially expressive.
4. Motion Manifold Models
4.1 VAE-Based Approaches
Variational autoencoders are the primary framework for learning continuous latent manifolds over motion data. Ling et al. (2020) introduced Motion VAE, trained on motion-capture data, demonstrating that sampling from the learned prior produces physically plausible motion and that latent interpolation produces smooth transitions—provided the latent space is well-regularized. The critical limitation is that the standard Gaussian prior treats the latent space as Euclidean, which approximates the true data manifold's curvature arbitrarily. For regions of high curvature in pose space, this approximation degrades.
Petrovich et al. (2021) extended this approach with ACTOR, a Transformer VAE that conditions motion generation on action labels operating over latent codes. The conditioning architecture provides semantic handles on the action content dimension—an early form of manifold-structured conditioning—though the latent space geometry remains Euclidean.
A geometrically principled alternative replaces the Euclidean prior with one intrinsic to the manifold's topology. Falorsi et al. (2018) explored this directly for Lie group VAEs: replacing with distributions defined on SO(3) and SE(3), matching the latent space geometry to the data geometry, reducing posterior collapse, and enabling meaningful geodesic traversal in the latent space. The posterior becomes a distribution on the manifold, and the KL divergence is computed with respect to the manifold's invariant measure rather than a flat Gaussian.
4.2 Flow-Based Approaches
Normalizing flows learn exact-likelihood densities by transforming a base distribution through invertible maps. Standard flows operate in , inheriting all the geometric limitations of Euclidean motion space.
Henter et al. (2020) introduced MoGlow, an autoregressive normalizing flow over motion sequences conditioned on control signals, achieving strong motion quality and diversity over prior recurrent models. MoGlow operates in joint angle space directly—technically flat—but its autoregressive factorization captures temporal transition densities that implicitly encode manifold structure through learned conditional distributions.
The frontier is flows operating intrinsically on Riemannian manifolds. Gemici et al. (2016) established the theoretical foundations, showing that volume-preserving flows can be defined via the Riemannian volume form. More recent work applies neural ODEs integrated along geodesics—Riemannian continuous normalizing flows—to transform distributions on curved spaces without requiring a flat atlas. Applied per joint to SO(3), such flows learn the true density of pose transitions without Euclidean approximation.
Tevet et al. (2022) demonstrated that diffusion models—continuous-time flows through noise scales—outperform VAE and autoregressive baselines on human motion generation. Their Motion Diffusion Model conditions generation on text and action labels via classifier-free guidance, operating in Euclidean joint-angle space. The score-based formulation is, however, natural to generalize to manifold-valued diffusion (De Bortoli et al., 2022), and the combination—large-scale conditional diffusion operating on the Riemannian pose manifold—constitutes the most promising near-term architecture for manifold-aware conditioning.
5. What Manifold-Level Conditioning Looks Like
A manifold-level conditioning signal does not provide a pose array. It provides a point on the motion manifold together with a tangent vector—a position and intrinsic velocity in the manifold's geometry. Concretely, it can take several forms:
Latent manifold coordinate with tangent direction. A point on a learned compact manifold (hyperspherical, toroidal, or Lie-group-structured) augmented by a tangent vector indicating the direction and rate of change through the manifold. This encodes not just where a motion is, but how it is moving through motion space—enabling prediction of trajectory rather than just instantaneous state.
Geodesic segment specification. Rather than two endpoint poses, a conditioning segment defines the intrinsically correct path between them. The generation is constrained to follow that geodesic, guaranteeing physical validity throughout the interpolation and enabling smooth transitions between any two valid states.
Disentangled factor coordinates. Separate coordinates on the action-content submanifold and the style/quality submanifold, enabling independent specification. An operator can hold content fixed while traversing style space, or vice versa—operations that are geometrically well-defined on the product manifold but undefined in flat keypoint space.
A differential conditioning field. A section of the tangent bundle over the manifold—specifying, at each point in motion space, a preferred direction of change. This is analogous to a vector field on a Riemannian manifold: it guides generation not toward a fixed target but according to a consistent directional intent across an entire region of the space.
6. What This Architecture Would Enable
The transition to manifold-aware conditioning is qualitatively different from incremental improvements to keypoint systems:
Physically consistent interpolation. Between any two conditioning states, generation follows geodesics, eliminating the "melting" or anatomically implausible artifacts that arise when naive latent interpolation passes through out-of-distribution configurations.
Compositional specification. Manifold structure exposes degrees of freedom independently—allowing action intent and movement quality to be specified without interference, aligning the conditioning interface with the natural dimensionality of motion intent.
Equivariance under body symmetries. Riemannian conditioning signals are covariant under the symmetries of the motion manifold—global body rotation, bilateral mirroring, temporal scaling. A model conditioned manifold-level generalizes across these symmetries without augmentation overhead, because the representation already respects them.
Gradient-valid latent navigation. With a proper Riemannian metric on the conditioning space, gradient-based optimization in conditioning space becomes geometrically valid—enabling motion retrieval, constraint satisfaction, and creative exploration with correct distance semantics. The nearest valid pose to a constraint is now well-defined.
Temporal coherence by construction. Motion as geodesic flow in latent space means temporal consistency is not learned from data as an empirical tendency but enforced structurally. The model cannot generate temporally incoherent sequences because the conditioning signal is itself a smooth trajectory on the manifold.
The infrastructure for this shift is largely assembled: Riemannian generative models, Lie group VAEs, disentanglement frameworks, and manifold-valued diffusion all exist as separate contributions. What has not yet been integrated is a unified conditioning architecture in which signals are first-class citizens of the motion manifold—where the interface between user intent and generative model operates entirely in the intrinsic geometry of motion space, rather than projecting that geometry into an ill-suited Euclidean container.
References
Aberman, K., Weng, Y., Lischinski, D., Cohen-Or, D., & Chen, B. (2020). Unpaired motion style transfer from video to animation. ACM Transactions on Graphics, 39(4), Article 64. https://doi.org/10.1145/3386569.3392469
De Bortoli, V., Mathieu, E., Hutchinson, M., Thornton, J., Teh, Y. W., & Doucet, A. (2022). Riemannian score-based generative modelling. Advances in Neural Information Processing Systems, 35, 2406–2422.
Falorsi, L., de Haan, P., Davidson, T. R., De Cao, N., Weiler, M., Forré, P., & Cohen, T. S. (2018). Explorations in homeomorphic variational auto-encoding. arXiv preprint arXiv:1807.04689.
Gemici, M. C., Rezende, D., & Mohamed, S. (2016). Normalizing flows on Riemannian manifolds. arXiv preprint arXiv:1611.02304.
Henter, G. E., Alexanderson, S., & Beskow, J. (2020). MoGlow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics, 39(6), Article 236. https://doi.org/10.1145/3414685.3417836
Holden, D., Saito, J., & Komura, T. (2016). A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics, 35(4), Article 138. https://doi.org/10.1145/2897824.2925975
Ling, H. Y., Zinno, F., Cheng, G., & van de Panne, M. (2020). Character controllers using motion VAEs. ACM Transactions on Graphics, 39(4), Article 40. https://doi.org/10.1145/3386569.3392422
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2015). SMPL: A skinned multi-person linear model. ACM Transactions on Graphics, 34(6), Article 248. https://doi.org/10.1145/2816795.2818013
Petrovich, M., Black, M. J., & Varol, G. (2021). Action-conditioned 3D human motion synthesis with transformer VAE. Proceedings of the IEEE/CVF International Conference on Computer Vision, 10985–10995.
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., & Bermano, A. H. (2022). Human motion diffusion model. arXiv preprint arXiv:2212.04048.
Zhou, Y., Barnes, C., Lu, J., Yang, J., & Li, H. (2019). On the continuity of rotation representations in neural networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5745–5753.