Innovation Brief: Toward a Somatic Latency Index — Redefining "Real-Time" for Embodied Visual Feedback

Field: Embodied AI / Somatic Computing / Movement-Responsive Systems Brief Type: Experimental Provocation Word Count: ~1,100


The Provocation

There is a number haunting every somatic-AI system being built today: 100 milliseconds. Developers cite it as the threshold below which digital response feels "simultaneous" with its trigger. Motion capture pipelines are benchmarked against it. Generative visual systems are optimized to breach it. It has become, quietly, the latency law of the emerging field.

The problem is that nobody derived it from a body in motion.

The 100ms threshold originates in psychoacoustic research — specifically, the study of echo perception and audio feedback in speech and music performance (Välimäki et al., 2012). It was never proposed as a universal constant of embodied perception, and yet somatic computing has imported it wholesale, as though the nervous system processes a visual ripple in response to a pelvis shift with the same temporal logic it applies to a guitar note meeting its reverb tail.

This brief argues that the 100ms threshold is the wrong question applied to the wrong modality. It proposes a purpose-built experiment to identify what we might call a Somatic Latency Index — a differentiated set of temporal thresholds specific to movement registers, perceptual modes, and the quality of felt continuity between body and generative image.


Why Existing Thresholds Are Wrong

The 100ms figure derives from two distinct bodies of work: Hafter and Buell's (1990) work on auditory echo suppression, and the motor control literature on sensorimotor feedback loops, which identifies roughly 100–150ms as the window for proprioceptive correction during voluntary movement (Wolpert & Kawato, 1998). Neither of these is a theory of visual perception of self-generated movement in an expressive context.

Three specific problems compound when we apply audio thresholds to somatic-visual feedback:

1. Movement registers operate on radically different timescales. Postural shifts — weight transfers, spinal lengthening, the settling of a held shape — unfold over 2–8 seconds. Gestural movement — the arc of an arm, a head turn — operates in the 400ms–1.5s range. Breath operates on a 3–6 second cycle with micro-variations in the 100–300ms range. These are not variations on the same phenomenon. They are phenomenologically distinct registers of bodily self-experience, each with different attentional structures and different tolerances for temporal disjunction.

2. The question of continuity is not the same as the question of synchrony. Audio latency research asks: when does the echo become distinguishable from the source? Somatic feedback research should ask something different: when does the response feel like it belongs to this movement, versus a response to a previous movement? These are different cognitive events. The first is a detection task. The second is a proprioceptive-narrative integration task — closer to the felt sense literature (Gendlin, 1978) than to signal detection theory.

3. Generative visual response is not a passive reflection. A reverb tail restates a sound. A generative visual system interprets a movement — it produces a novel image that must be read as continuous with the act that produced it. The latency that feels continuous is therefore not purely a function of timing; it is also a function of semantic proximity and visual coherence. Two responses at identical latencies may feel radically different in their "belonging-to-the-body" quality depending on what is generated.


The Proposed Experiment

We propose a controlled phenomenological study, provisionally titled Threshold of Belonging, combining quantitative latency manipulation with structured somatic inquiry.

Participants: 24 participants drawn from three populations — trained somatic practitioners (e.g., Body-Mind Centering, Authentic Movement), professional dancers, and movement-naive participants — to examine whether somatic training mediates latency perception.

Setup: Participants move freely in front of a motion-capture and RGB camera array. Their movement drives a generative visual system (diffusion-based, latency-controllable) that projects a real-time visual response onto a screen in their peripheral field — visible but not requiring direct gaze. Peripheral vision is intentional: it minimizes the contribution of voluntary visual attention and foregrounds ambient, proprioceptive-adjacent perception.

Latency Conditions: Seven latency conditions — 0ms (playback with artificial synchrony), 50ms, 100ms, 200ms, 400ms, 800ms, and 1600ms — are presented in randomized order across movement register prompts: postural exploration (minimal, sustained), gestural improvisation (expressive, varied), and breath-following (subtle, cyclic).

Measurement: Two primary measures are employed. First, a binary phenomenological report: after each condition, participants are asked not "did you notice a delay?" but "did the image feel like yours?" — specifically framing the question in terms of ownership and narrative belonging rather than temporal detection. Second, a five-point felt-continuity scale adapted from Botvinick and Cohen's (1998) rubber hand illusion methodology, reframed for movement rather than static proprioception.

Secondarily, we record gaze behavior (to track whether participants are drawn to look directly at the screen — a sign the response has become cognitively rather than somatically processed) and galvanic skin response as a proxy for somatic disruption.

Analysis: Threshold curves are constructed per movement register and per population. The primary outcome is not a single number but a threshold profile — a differentiated map of latency tolerances by register, training background, and image coherence quality.


What Success Looks Like

Success is not a new universal constant. The field should resist replacing one borrowed number with another.

Success is a differentiated framework that allows practitioners and system designers to ask: what movement register is this system primarily coupled to? If postural, the acceptable latency window may be as wide as 600–800ms. If gestural, it narrows toward 150–200ms. If breath-coupled, the threshold may be less about absolute latency and more about phase coherence within the breath cycle — an entirely different temporal logic.

Success is also a set of validated phenomenological instruments — replicable, open-source questionnaires and interview protocols — that allow any somatic-AI research group to measure felt continuity rather than technical synchrony. These do not currently exist as a standardized toolkit.

Finally, success establishes that the question of latency cannot be separated from the question of what is generated. The experiment's secondary analysis — comparing high- versus low-coherence visual outputs at identical latencies — should yield the first empirical data on whether somatic ownership of a generative image is a temporal phenomenon, a semantic phenomenon, or an irreducibly entangled combination of both.


The Invitation

This experiment cannot be done by a single discipline. It requires motion capture engineers, diffusion model practitioners, somatic educators, phenomenological researchers, and cognitive neuroscientists who study bodily self-model — ideally working in the same room, watching the same body move, disagreeing about what they are seeing.

The provocation, restated: real-time is not a technical specification. It is a claim about consciousness — specifically, about the conditions under which a body recognizes its own action in the world. Building somatic-AI systems without empirically grounding that claim is not agile development. It is a category error dressed as engineering.

The field is invited to treat the latency question as seriously as it treats the generative quality question. The body, it turns out, has opinions.


References

Botvinick, M., & Cohen, J. (1998). Rubber hands 'feel' touch that eyes see. Nature, 391(6669), 756. https://doi.org/10.1038/35784

Gendlin, E. T. (1978). Focusing. Everest House.

Hafter, E. R., & Buell, T. N. (1990). Restarting the adapted binaural system. Journal of the Acoustical Society of America, 88(2), 806–812. https://doi.org/10.1121/1.399730

Välimäki, V., Parker, J. D., Savioja, L., Smith, J. O., & Abel, J. S. (2012). Fifty years of artificial reverberation. IEEE Transactions on Audio, Speech, and Language Processing, 20(5), 1421–1448. https://doi.org/10.1109/TASL.2012.2189567

Wolpert, D. M., & Kawato, M. (1998). Multiple paired forward and inverse models for motor control. Neural Networks, 11(7–8), 1317–1329. https://doi.org/10.1016/S0893-6080(98)00066-5