Reading Between the Frames: What AI Motion Systems Miss in the Transitions

Movement lives in the spaces between moments. That is precisely where most AI systems stop looking.


A standard motion capture session records the position of markers on a human body at 120 frames per second. That sounds extremely precise — faster than the eye can follow, denser than film, finer than most of what humans consciously perceive. And for many purposes, it is more than enough.

But here is a fact that changes how you think about all of this: at 120 frames per second, the gap between each recorded frame is 8.3 milliseconds. And a great deal of what constitutes the felt, expressive quality of human movement happens in intervals much shorter than that — in the acceleration of a reaching arm in its first ten milliseconds, in the micro-hesitation before a weight transfer, in the tiny deceleration that distinguishes a controlled stop from a collapse.

These are not gaps in the data. They are the texture of movement. And they are what most AI motion generation systems do not model.


What "Between the Frames" Actually Means

When a motion capture system records a person walking, it produces a sequence of skeletal snapshots: here is where each joint was at time 0, here is where it was 8.3ms later, and so on. The AI model trained on this data learns the statistical relationships between these snapshots — what joint configurations tend to follow other joint configurations, at what speeds, across what ranges of movement.

What the model does not learn is the transition between those configurations. Mathematically, the path between any two skeletal poses is defined by the data, but the quality of traversal — whether a joint moved through that path with constant velocity or with a sudden burst followed by deceleration, whether the movement arrived with a sense of arrival or simply stopped — is often not preserved in the discrete snapshot representation.

This matters because in almost every movement practice, the transition quality is the meaning. A gesture that arrives with a soft deceleration communicates something different from the same gesture arriving with a hard stop, even if the start and end positions are identical. A weight transfer that passes through a moment of genuine suspension — a hovering instant before gravity takes over — is different from a weight transfer that merely visits the same skeletal configuration without that suspension.

Somatic practitioners train extensively on exactly these transition qualities. In Body-Mind Centering, practitioners spend months developing sensitivity to cellular breathing — the micro-fluctuations in body tissue between the larger gross motor movements. In Authentic Movement, the moment just before a movement impulse becomes visible motion is considered one of the most important moments in the practice. The Japanese performance tradition of Butoh talks about the "ma" — the interval, the negative space between actions — as the carrier of meaning.


Why This Gap Exists in AI Systems

There are two reasons why current AI motion systems systematically miss this transition texture.

The first is data. Motion capture recordings are sequences of discrete samples. The continuous path between samples is inferred by interpolation — usually cubic spline interpolation, which produces smooth curves but cannot recover qualitative acceleration and deceleration patterns that were not explicitly captured. If you train a generative model on motion capture data, you are training it on smoothly interpolated trajectories, not on the actual acceleration profiles of lived movement.

The second is evaluation. AI motion generation systems are evaluated on metrics like Fréchet Inception Distance (applied to motion features), foot skating rate, and perceptual studies asking "does this motion look natural?" These metrics measure the gross shape of movement and its visual plausibility, not the micro-temporal texture that practitioners attend to. A generated motion can score well on every standard metric while completely flattening the expressive transition qualities that a somatic practitioner would immediately notice.

Some recent work is beginning to address this. The Elastic Shape VAE approach published this week (arXiv:2605.09231) encodes skeletal sequences in a manifold-aware representation where the geometry of the transition — not just the start and end pose — is part of the model. The multi-scale motion generation work coming out of CVPR 2026 is also explicitly targeting temporal resolution at multiple granularities. But these are early steps.


A Concrete Example: The Suspension

In dance technique across many traditions — ballet, contemporary, African dance forms, Butoh — one of the most cultivated qualities is suspension: the moment in a movement phrase where the body appears to momentarily defy gravity, held in a dynamic equilibrium before the next phase begins.

Suspension is not a static pose. It is a precise manipulation of the relationship between momentum and gravity at a micro-temporal scale: the body continues to move through the suspension point, but the movement slows through a specific deceleration curve and then re-accelerates. The suspension is in the shape of that deceleration.

Ask a current AI motion generation system to produce a phrase with suspension. It will produce something that looks roughly right from a distance — the right poses in roughly the right sequence. But the suspension will be missing. The system will simply produce a sequence of poses connected by smooth interpolations, without the specific deceleration pattern that makes a suspension feel like suspension rather than just a brief slow moment.

This is not a failure of the AI system on its own terms. It was never trained to produce or evaluate suspensions, because the concept does not exist in its training data or its evaluation metrics. The training data was recorded at 120fps and then smoothed. The evaluation metrics measure visual plausibility and realism to naive observers, not expressive accuracy to trained movers.


What This Means Going Forward

The frame-between-frames problem is solvable in principle. It requires training data with higher temporal resolution and qualitative annotation, evaluation metrics that capture transition quality (not just pose quality), and architectural choices in the generative model that preserve micro-temporal information rather than discarding it.

It also requires input from somatic practitioners — people who have trained their attention to notice exactly the qualities that current datasets and metrics miss. The vocabulary for describing these qualities exists (suspension, impulse, rebound, melt, float — the Laban Effort qualities) and has been developed over decades of movement research. What has been missing is the connection between that vocabulary and the technical design of AI motion systems.

That connection is beginning to form. But for now, when you watch an AI-generated movement and something feels subtly off — the motion looks right but doesn't feel right — it is probably the transitions you are sensing. The frames themselves are there. What is missing is what lives between them.


APA Further Reading

Laban, R., & Lawrence, F. C. (1947). Effort: Economy in body movement. Macdonald & Evans.

Newlove, J., & Dalby, J. (2004). Body movement: Mastering the work of Rudolf Laban. Nick Hern Books.

Rahman, A., Kumar, S., Barnes, L. E., & Srivastava, A. (2026). An elastic shape variational autoencoder for skeleton pose trajectories. arXiv:2605.09231. https://arxiv.org/abs/2605.09231

Sheets-Johnstone, M. (1966). The phenomenology of dance. University of Wisconsin Press.