The Resonance Benchmark: Kinesthetic Empathy as Evaluation Framework for Somatic-AI Motion Generation

Introduction

When we watch a dancer fall, something in our own body registers the fall. This is not metaphor. The felt sense of witnessing movement — its pull on our own musculature, the micro-tensions it occasions in our spine and diaphragm — constitutes what phenomenologists call kinesthetic empathy: a mode of bodily knowing that operates beneath conscious interpretation. Vittorio Gallese's embodied simulation theory provides the most rigorously neuroscientific account of this phenomenon to date, grounding it in the mirror neuron system and proposing that the perception of movement is never purely spectatorial but always, at some level, a functional re-enactment.

The emergence of generative motion models — systems that synthesise human movement from latent representations, text prompts, or learned dynamics — raises a question that the field has not yet adequately asked: what would it mean for a machine-generated motion to feel resonant to a human observer? Current evaluation frameworks in motion AI rely heavily on Fréchet Inception Distance (FID) scores adapted for motion features, biomechanical plausibility metrics, and perceptual studies that assess naturalness on Likert scales (Guo et al., 2022; Tevet et al., 2022). These are not trivial achievements. But they do not touch what Gallese's theory identifies as the operative variable in embodied encounters with movement: the degree to which a perceived motion activates the observer's own motor programmes. This synthesis argues that kinesthetic empathy, as theorised in embodied simulation research, offers a phenomenologically grounded benchmark for motion AI evaluation that complements and, in crucial respects, exceeds what FID and biomechanical accuracy can measure.


The Neuroscientific Ground: Mirror Neurons and Embodied Simulation

The discovery of mirror neurons in macaque premotor cortex — neurons that discharge both during action execution and during observation of the same action performed by another — opened a biological pathway for understanding intersubjectivity that bypassed the inferential theories dominant in philosophy of mind (di Pellegrino et al., 1992; Rizzolatti & Craighero, 2004). In humans, analogous mirror systems have been identified through neuroimaging in Broca's area, the inferior frontal gyrus, and the superior temporal sulcus, though the human system is considerably more distributed and context-sensitive than its macaque homologue (Gallese et al., 1996; Iacoboni et al., 1999).

Gallese's contribution was to extend this empirical finding into a theoretical architecture capable of explaining not just action mirroring but the full phenomenology of intersubjective resonance. His embodied simulation theory proposes that the brain's motor system functions as a simulation engine: when we perceive another person's movement, facial expression, or emotional state, we implicitly simulate it using the same neural substrates that would be recruited were we performing or experiencing it ourselves (Gallese, 2005; Gallese & Sinigaglia, 2011). Simulation, on this account, is not analogical reasoning applied post-hoc to sensory data; it is constitutive of perception itself. "We do not merely observe the actions of others," Gallese writes; "we functionally re-enact them" (Gallese & Sinigaglia, 2011, p. 517).

This has direct implications for dance and movement perception. Sheets-Johnstone (2011) argues that kinesthesia is the foundational modality of animate life, prior to and constitutive of spatial and temporal experience — a claim that converges with Merleau-Ponty's (1962/2012) insistence that the body is not an object in the world but the very medium through which world is disclosed. When these phenomenological frameworks are read alongside Gallese's neuroscience, kinesthetic empathy emerges not as an aesthetic luxury but as a structural feature of motor cognition: the observer's body is always already implicated in what it witnesses.

Empirical support for kinesthetic resonance in dance spectatorship is plausible-to-well-supported. Calvo-Merino et al. (2005) demonstrated that expert dancers show greater motor cortex activation when observing movements they have trained in than movements they have not, suggesting that motor experience modulates the depth of simulation. Reason and Reynolds (2010) documented first-person audience accounts of kinesthetic response during dance performance, mapping a phenomenology that ranges from micro-postural adjustment to full somatic identification. Foster's (2011) theorisation of choreographing empathy links these somatic responses to cultural and technical conditions of spectatorship — a critical note we return to below.


The Motion AI Landscape: What Is Being Generated, and How Is It Evaluated?

Contemporary motion generation systems fall into roughly three categories: physics-based controllers that optimise for biomechanical plausibility (Rempe et al., 2021; Yuan et al., 2023), diffusion-based generative models conditioned on text or audio (Tevet et al., 2022; Zhang et al., 2022; Petrovich et al., 2022), and autoregressive sequence models that treat motion as language (Jiang et al., 2023). The field has achieved remarkable results in producing motion that reads as physically plausible and stylistically coherent when evaluated against motion-capture corpora. Text-conditioned models can now synthesise recognisable movement vocabulary — "a person walks cautiously across a room" yields output that human raters reliably identify as cautious walking.

The standard evaluation protocol for generative motion models inherits metrics from the image generation literature, principally FID computed over feature embeddings from pretrained motion encoders, alongside R-precision (whether the generated motion retrieves its conditioning text) and diversity scores (Guo et al., 2022). These metrics operationalise two things: distributional similarity to a reference corpus of human motion capture, and semantic coherence with conditioning signals. What they do not operationalise is felt resonance — the degree to which a motion activates an observer's motor system rather than merely satisfying a distributional or semantic criterion.

This gap is not incidental. It reflects a latent assumption embedded in motion AI research: that human movement can be adequately described as a sequence of joint angles and velocities, and that generation fidelity is a matter of reproducing that sequence within acceptable statistical tolerances. The Gallesian framework challenges this assumption directly. Motion, on embodied simulation theory, is not primarily a geometric event; it is a solicitation — it calls forth a motor response in the witness. A generative system that reproduces the joint-angle statistics of human movement without reproducing its motor solicitation quality would, on this account, be generating movement that looks right without feeling resonant.


The Structural Analogy: Can a Generative Model Achieve Motor Resonance?

This is the most speculative terrain of the synthesis, and the confidence label must be stated clearly: the following is speculative, grounded in structural analogy rather than empirical evidence.

Gallese's embodied simulation engine has a functional description: it receives sensory input representing another agent's movement, runs that input through motor-predictive circuits, and generates an internal state that functionally mirrors the observed motion. The key computational properties are: (1) the system has an internal model of movement dynamics, (2) it uses that model to simulate — not just classify — what it perceives, and (3) the simulation is embodied in the sense that it recruits the same circuits that would generate movement, not merely circuits that represent it symbolically.

Generative motion models have a structurally analogous architecture in certain respects. A diffusion model trained on human motion implicitly learns a prior over movement dynamics; during generation, it runs a denoising process through that learned prior. World-model approaches to motion (Hafner et al., 2019; Ha & Schmidhuber, 2018) are even more explicitly simulatory: they maintain an internal predictive model of body dynamics and generate movement by rolling forward through that model. One could argue that these systems, in learning to predict body dynamics from data, have acquired something functionally analogous to motor representation — a latent encoding of how bodies move that is not merely statistical but dynamically generative.

The caveat is essential: this structural analogy does not constitute evidence that such systems experience anything, nor that their internal states bear any phenomenological relationship to the felt quality of motor resonance. Gallese's account is grounded in the fact that the human observer's mirror system is continuous with the motor system that actually moves their body — the simulation activates circuits with motor consequence. A generative model's internal dynamics have no such embodied grounding. The analogy is at the level of functional description, not mechanism. We should treat it as a design aspiration and an evaluation prompt, not a claim about machine phenomenology.


A Phenomenologically Grounded Evaluation Framework

What, then, would it mean to evaluate motion AI against a kinesthetic empathy benchmark? Three operationalisation strategies are plausible:

1. Motor activation as a perceptual criterion. Rather than asking raters "how natural does this look?", evaluation studies could recruit methods from experimental aesthetics and dance science — EMG measurement of muscle activation in observers, fMRI paradigms comparing observer responses to AI-generated versus motion-captured movement, or validated psychophysical scales for kinesthetic resonance adapted from Reason and Reynolds (2010). This would require interdisciplinary infrastructure not yet standard in the motion AI field but methodologically tractable.

2. Expert somatic witness as proxy. Trained somatic practitioners — Body-Mind Centering practitioners, Contact Improvisation teachers, Laban Movement Analysis practitioners — demonstrate reliable inter-rater agreement on qualitative dimensions of movement such as effort quality and spatial intent (Laban & Lawrence, 1947/2011; Hackney, 2002). Recruiting such expertise as an evaluation cohort would operationalise kinesthetic resonance through trained somatic sensitivity rather than neural instrumentation, at considerably lower cost.

3. Generative feedback architectures. The most ambitious operationalisation would close the loop entirely: building evaluation into a feedback system in which an AI model trained on somatic witness reports iteratively revises generated motion toward higher resonance ratings. This requires a training signal derived from felt quality rather than geometric accuracy — a significant departure from current motion AI pipelines, but one that the field's increasing interest in human-in-the-loop generation makes tractable.

Foster's (2011) caution deserves weight here: kinesthetic empathy is not a culturally neutral faculty. What produces motor resonance is shaped by movement culture, trained attention, and social position. An evaluation framework built on kinesthetic empathy must therefore control for observer movement background and resist the assumption that any single population's somatic responses constitute a universal standard.


Conclusion

Gallese's embodied simulation theory does not merely describe how humans perceive movement; it specifies the functional target that any somatic-AI system aspiring to felt resonance must approximate. The neuroscientific evidence for mirror-system activation during movement observation is well-supported. The extension of this framework to AI-generated motion as a design and evaluation criterion is speculative but structurally coherent and methodologically tractable. The field's current reliance on distributional and semantic metrics captures what generated movement is but not what it does to an embodied observer. Kinesthetic empathy, taken seriously as an evaluation criterion, shifts the question from fidelity-to-corpus to resonance-in-body — a shift that may prove definitive for the next generation of somatic-AI systems.


References

Calvo-Merino, B., Glaser, D. E., Grèzes, J., Passingham, R. E., & Haggard, P. (2005). Action observation and acquired motor skills: An fMRI study with expert dancers. Cerebral Cortex, 15(8), 1243–1249. https://doi.org/10.1093/cercor/bhi007

di Pellegrino, G., Fadiga, L., Fogassi, L., Gallese, V., & Rizzolatti, G. (1992). Understanding motor events: A neurophysiological study. Experimental Brain Research, 91(1), 176–180. https://doi.org/10.1007/BF00230027

Foster, S. L. (2011). Choreographing empathy: Kinesthesia in performance. Routledge.

Gallese, V. (2005). Embodied simulation: From neurons to phenomenal experience. Phenomenology and the Cognitive Sciences, 4(1), 23–48. https://doi.org/10.1007/s11097-005-4737-z

Gallese, V., Fadiga, L., Fogassi, L., & Rizzolatti, G. (1996). Action recognition in the premotor cortex. Brain, 119(2), 593–609. https://doi.org/10.1093/brain/119.2.593

Gallese, V., & Sinigaglia, C. (2011). What is so special about embodied simulation? Trends in Cognitive Sciences, 15(11), 512–519. https://doi.org/10.1016/j.tics.2011.09.003

Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., & Cheng, L. (2022). Generating diverse and natural 3D human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5152–5161). https://doi.org/10.1109/CVPR52688.2022.00509

Ha, D., & Schmidhuber, J. (2018). World models. arXiv preprint arXiv:1803.10122. https://doi.org/10.5281/zenodo.1207631

Hackney, P. (2002). Making connections: Total body integration through Bartenieff fundamentals. Routledge.

Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2019). Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603.

Iacoboni, M., Woods, R. P., Brass, M., Bekkering, H., Mazziotta, J. C., & Rizzolatti, G. (1999). Cortical mechanisms of human imitation. Science, 286(5449), 2526–2528. https://doi.org/10.1126/science.286.5449.2526

Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., & Chen, T. (2023). MotionGPT: Human motion as a foreign language. In Advances in Neural Information Processing Systems, 36.

Laban, R., & Lawrence, F. C. (2011). Effort: Economy in body movement (3rd ed.). Macdonald & Evans. (Original work published 1947)

Merleau-Ponty, M. (2012). Phenomenology of perception (D. A. Landes, Trans.). Routledge. (Original work published 1962)

Petrovich, M., Black, M. J., & Varol, G. (2022). TEMOS: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision (pp. 480–497). Springer. https://doi.org/10.1007/978-3-031-20047-2_28

Reason, M., & Reynolds, D. (2010). Kinesthesia, empathy, and related pleasures: An inquiry into audience experiences of watching dance. Dance Research Journal, 42(2), 49–75. https://doi.org/10.1017/S0149767700001030

Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., & Guibas, L. J. (2021). HuMoR: 3D human motion model for robust pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 11468–11479).

Rizzolatti, G., & Craighero, L. (2004). The mirror-neuron system. Annual Review of Neuroscience, 27, 169–192. https://doi.org/10.1146/annurev.neuro.27.070203.144230

Sheets-Johnstone, M. (2011). The primacy of movement (2nd ed.). John Benjamins.

Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., & Bermano, A. H. (2022). Human motion diffusion model. arXiv preprint arXiv:2209.14916.

Yuan, Y., Song, J., Iqbal, U., Vahdat, A., & Kautz, J. (2023). PhysDiff: Physics-guided human motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 16010–16021).

Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., & Liu, Z. (2022). MotionDiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001.