The Moving Body as Forward Model: Merleau-Ponty's Motor Intentionality and Predictive Architectures in Generative AI
Framing the Intersection
There is a moment in the phenomenology of movement that Maurice Merleau-Ponty describes with unusual precision: the moment when a practised action — reaching for a glass, landing from a jump, beginning a turn — is initiated by the body without passing through conscious deliberation. The hand moves toward the glass before the will has formulated the command. The dancer lands before the calculation is complete. This is what Merleau-Ponty calls motor intentionality — intentionality that belongs to the body as a skilled agent, not to the consciousness that oversees it.
Motor intentionality is not a minor footnote in Merleau-Ponty's phenomenology. It is the central argument of Phenomenology of Perception (1945): that the body is not an object in the world commanded by a separate mind, but an intentional arc that reaches toward possibilities and closes on them directly, below the level of representation. The body knows where the glass is not because it has computed a coordinate; it knows it because reaching toward glasses is already part of its habitual body schema.
What makes this argument newly urgent is that the most technically significant development in AI motion generation over the past year is the shift toward predictive architectures — systems designed not merely to reconstruct what happened but to predict what will happen next, generating movement sequences that respect the causal dynamics of the body rather than simply compressing the appearance of recorded motion. This is, in formal terms, the construction of what the neuroscience literature calls a forward model: a system that predicts the sensory consequences of an action before the action occurs.
Merleau-Ponty's motor intentionality is, functionally, a forward model. This essay develops that claim precisely and traces its implications for how we should design, evaluate, and understand AI motion generation systems.
Motor Intentionality: The Argument in Detail
Phenomenology of Perception builds toward its account of motor intentionality through a critique of two prior theories of action: the empiricist account (movement is a conditioned reflex to sensory stimulus) and the intellectualist account (movement is the execution of a mental representation). Merleau-Ponty argues that both accounts presuppose a separation between the perceiving-thinking subject and the moving body that does not exist in skilled embodied action.
His central case is the practised action. When a typist types, they do not consciously direct each finger to each key; the finger-to-key relationships are absorbed into the body schema, and the typist thinks in terms of words, not keystrokes. When a musician plays, the instrument becomes an extension of the body schema — the spatial reach of the instrumentalist literally extends to the bell of the horn. When a dancer executes a memorised phrase, the phrase is not retrieved from memory and executed step by step; it flows from the body as an expressive arc toward a movement possibility that the body already knows how to reach.
This knowledge — the body's knowledge of how to reach the movement possibility — is motor intentionality. It is intentional in the philosophical sense: it is directed toward something, it has content, it reaches toward an object (the phrase, the glass, the key). But it is not cognitive intentionality in the sense that requires a mental representation of the object; the body reaches directly, through the pre-reflective habituality of the schema.
The formal structure of motor intentionality is precisely:
- The body carries a model of its own action capabilities (what Merleau-Ponty calls the body schema)
- That model projects possibilities forward — it reaches toward what it can do
- The action closes on the possibility directly, without passing through explicit representation
- The model updates through experience: each successful and failed action refines the schema
This is the structure of a forward model. The body schema is the model; the projection of possibilities is the forward prediction; the action is the output; the sensory consequence is the error signal that updates the schema.
Predictive Architectures in AI Motion Generation
The architectural shift in AI motion generation toward predictive objectives is documented in recent technical literature. The Video Generation with Predictive Latents paper (Zhao et al., 2026; arXiv:2605.02134) introduces a training objective in which the generative model must simultaneously reconstruct observed frames and predict future frames that have not been seen. The core claim is that this predictive objective forces the model's latent representations to encode causal structure — the rules governing how the current state evolves — rather than the appearance of individual frames.
This is a departure from the dominant paradigm. Standard video diffusion models are trained on reconstruction: given a noisy version of a frame, denoise it. The model learns to compress and recover appearance. It does not learn, in the process, what will happen next. The predictive objective specifically targets that absence.
The parallel to motor intentionality is structural. The reconstructive model corresponds to the empiricist account Merleau-Ponty critiques: it learns the appearance of movement as a conditioned association between inputs and outputs. The predictive model corresponds, at least structurally, to the body schema: it learns the causal dynamics — the rules — from which the next movement state can be projected.
The qualification "at least structurally" matters. The formal homology is real; the claim that the AI predictive model is doing the same thing as motor intentionality would be incorrect, and the distinction between them is precisely what illuminates the limits of current AI architecture.
Where the Homology Holds and Where It Breaks
The homology holds in three respects:
First, both systems generate movement from internal models rather than from stimulus-response chains. Motor intentionality is not triggered by a sensory input; it is initiated by the body's own projection of a motor possibility. The predictive AI model is not triggered by a retrieved stored sequence; it generates from a learned forward model applied to the current state.
Second, both systems improve by reducing prediction error. Motor learning, in Merleau-Ponty's account, is the refinement of the body schema through repeated exposure to sensory consequences of action. The predictive AI model reduces prediction error through gradient descent on a loss function that penalises the divergence between predicted and observed future states.
Third, both systems can generalise to novel situations within the domain of their learned model. A practised mover can perform a familiar movement in a new environment without conscious recalculation; the body schema extends to the new context. A well-trained predictive model can generate movement continuations for sequences it has not seen, by applying its causal model to the new input state.
The homology breaks in three respects, and the breaks are philosophically significant:
First, the content of the forward model differs categorically. Merleau-Ponty's body schema is a model of proprioceptive and kinaesthetic experience — it predicts how the body will feel as a result of the intended movement: the weight shift, the joint position, the muscular tension, the spatial reach. The AI model trained on video data predicts how the body will appear as a result of the movement: the pixel configuration, the skeletal landmark positions, the visual texture. These are not the same target.
This is not a minor difference. In somatic practice, the felt quality of movement — the quality that motor intentionality projects forward — is the primary material of the work. The visual appearance is a secondary consequence, often invisible from the outside, accessible only from within. An AI system that predicts appearance will converge on visually plausible motion but systematically miss the felt qualities that are the practitioner's actual domain.
Second, the body schema is built through embodied experience — the practitioner's own history of movement, sensation, correction, and accumulation. The AI model's forward model is built through gradient descent on recorded data from other bodies. The practitioner's schema knows what it feels like to land a jeté because the practitioner has landed jétés; the AI model has processed videos of others landing jétés. The models are both causal, but their causality is grounded differently.
Third, motor intentionality is evaluative as well as predictive. The body schema does not merely project possible movements; it projects better and worse possibilities, more and less appropriate ones for the current situation and intention. Merleau-Ponty describes this as the body's pre-reflective evaluation of affordances — the possible actions that the environment offers to this body at this moment. This evaluative dimension is absent from AI forward models, which have objective functions (prediction error) but no evaluative sensitivity to the significance of different possible movements for this mover in this moment.
Implications for the Design of Somatic AI Systems
The analysis above generates three specific design requirements for an AI motion system that would be genuinely commensurable with motor intentionality rather than merely formally homologous with it.
Requirement 1: Proprioceptive, not visual, prediction targets. A system designed to be conditioned on and to condition somatic practice should be trained to predict proprioceptive states — the expected felt quality of the body in movement — not just visual states. This requires training data that captures proprioceptive signals (IMU, EMG, joint torque) alongside video, and evaluation metrics that assess the fidelity of felt-quality prediction, not just visual plausibility. No current motion generation system meets this requirement.
Requirement 2: Identity-conditioned forward models. Motor intentionality is always the motor intentionality of this body with its history of movement. A body schema built from another person's movement history is not my body schema; it is an approximation that may fail at precisely the moments that matter most — the idiosyncratic movement solutions, the habitual preferences, the compensatory patterns that are expressions of this body's particular history. This argues for body-specific training as a core architectural requirement, not a refinement.
Requirement 3: Anticipatory conditioning. Motor intentionality precedes visible movement; it is active in the projection phase before the action closes on the possibility. A system that is conditioned on visible movement is always, by construction, reactive — it responds to what has already happened. A system that is conditioned on pre-movement signals (EMG pre-activation, preparatory postural adjustments, the specific quality of the moment before movement begins) is anticipatory — it responds to what is about to happen. This is the difference between tracking and co-creating.
Conclusion
Merleau-Ponty's motor intentionality and the predictive architectures emerging in AI motion generation are structurally homologous: both are forward models that generate movement from internal projections of what will happen rather than from conditioned responses to what has happened. The homology is real and technically productive — it identifies precisely where AI architecture needs to develop to become commensurable with somatic intelligence.
The breaks in the homology are equally productive: they specify exactly what current AI motion systems do not yet do (predict felt quality, not just appearance; build from embodied history, not population statistics; respond to pre-movement intention, not completed action) and therefore what research at the somatic-AI intersection must pursue. The argument from Merleau-Ponty is not that AI cannot reach motor intentionality; it is that reaching it requires a specific set of architectural and empirical choices that have not yet been made.
APA References
Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11(2), 127–138. https://doi.org/10.1038/nrn2787
Jia, W., et al. (2026). IAM: Identity-aware human motion and shape joint generation. arXiv:2604.25164. https://arxiv.org/abs/2604.25164
Merleau-Ponty, M. (1962). Phenomenology of perception (C. Smith, Trans.). Routledge. (Original work published 1945)
Narayanan, S., Jiang, Z., Narasimhan, S., & Chandraker, M. (2026). PhyCo: Learning controllable physical priors for generative motion. arXiv:2604.28169. https://arxiv.org/abs/2604.28169
Sheets-Johnstone, M. (2011). The primacy of movement (2nd ed.). John Benjamins. https://doi.org/10.1075/aicr.82
Zhao, Y., et al. (2026). Video generation with predictive latents. arXiv:2605.02134. https://arxiv.org/abs/2605.02134