The Body as Continuous Signal: A Practitioner Guide to Body-Sensing → Generative Visual Pipelines in TouchDesigner

Why Continuous Beats Discrete

Most interactive media systems treat the body as a remote control: a gesture fires an event, an event triggers content. This architecture borrows its logic from button presses, not from movement. But the body is never still between gestures — it breathes, sways, leaks stored tension into micro-adjustments that precede any nameable action. When we flatten that continuous field into discrete triggers, we discard the kinetic subtext that gives movement its expressive texture.

The alternative — treating the body as an unbroken signal — requires a different mental model. Rather than asking what did the body do?, we ask what is the body's current energetic state? The difference is consequential for generative imagery: a trigger system produces cut responses; a continuous system produces morphology. The visual output can carry qi yun (氣韻) — the resonant vitality that traditional Chinese aesthetics identify as the animating quality distinguishing living form from dead notation (Lin, 2015). This is the design target.

Technically, the pipeline has three stages: (1) skeletal signal extraction via MediaPipe, (2) feature engineering into perceptually meaningful channels, and (3) real-time diffusion synthesis via StreamDiffusion (Kodaira et al., 2023). Each stage must honor the continuous nature of the signal or the downstream benefit collapses.


Stage 1: MediaPipe in TouchDesigner via Python Script DAT

MediaPipe Pose runs efficiently inside TD's Python environment using a Script DAT that executes per-cook. The following assumes TD 2023.11340+ and a Python environment with mediapipe installed in TD's Python path.

# Script DAT: mp_pose_extract.py
# Cook mode: Every Frame
import mediapipe as mp
import numpy as np
import cv2

mp_pose = mp.solutions.pose

# Initialise once via storage
if 'pose' not in storage:
    storage['pose'] = mp_pose.Pose(
        model_complexity=1,
        smooth_landmarks=True,
        min_detection_confidence=0.5,
        min_tracking_confidence=0.5
    )

pose = storage['pose']

# Pull current frame from a Video Device In TOP named 'videoin'
top = op('videoin')
frame = np.array(top.numpyArray(delayed=False), dtype=np.uint8)
frame_rgb = cv2.cvtColor(frame[:, :, :3], cv2.COLOR_BGR2RGB)

results = pose.process(frame_rgb)

# Write 33 landmarks × {x,y,z,visibility} to a CHOP-compatible Table DAT
table = op('pose_table')
table.clear()
table.appendRow(['lm_id', 'x', 'y', 'z', 'vis'])

if results.pose_landmarks:
    for i, lm in enumerate(results.pose_landmarks.landmark):
        table.appendRow([i, lm.x, lm.y, lm.z, lm.visibility])
else:
    for i in range(33):
        table.appendRow([i, 0.0, 0.0, 0.0, 0.0])

Two MediaPipe notes worth internalising. First, smooth_landmarks=True applies a temporal filter that reduces jitter — this is already a step toward continuity, but it also introduces a slight lag that matters at high expressivity. Tune min_tracking_confidence rather than disabling smoothing. Second, the z-coordinate from BlazePose is estimated depth relative to the hip midpoint, not metric depth; treat it as a relative proxy for limb reach, not absolute position (Google, 2023).


Stage 2: Feature Engineering — Velocity, Directional Flow, Stored Energy

Raw landmark positions are coordinates. What generative models need are perceptually salient features. Three channels cover most expressive states.

Velocity Field

# Script CHOP: velocity_extract
# Requires: pose_table DAT updated upstream

import numpy as np

prev = storage.get('prev_lm', None)
curr_table = op('pose_table')

curr = np.array([[float(curr_table[i+1, 1]),
                  float(curr_table[i+1, 2]),
                  float(curr_table[i+1, 3])] for i in range(33)])

if prev is None:
    vel = np.zeros((33, 3))
else:
    vel = curr - prev   # delta per frame; scale by fps downstream

storage['prev_lm'] = curr.copy()

# Output: scalar speed per landmark (L2 norm of velocity vector)
speeds = np.linalg.norm(vel, axis=1)

# Write to output channels
for i in range(33):
    scriptOp.outputChop[f'speed_{i}'][0] = float(speeds[i])

# Aggregate: mean body speed as single conductor signal
scriptOp.outputChop['mean_speed'][0] = float(np.mean(speeds))

Directional Flow

Directional flow captures where the body is moving in 2D projection, useful for influencing compositional bias in generated frames.

# Directional flow: aggregate velocity vector across upper-body landmarks
UPPER = [11, 12, 13, 14, 15, 16, 0]  # shoulders, elbows, wrists, nose

upper_vel = vel[UPPER, :2]  # x,y only
flow_vec  = np.mean(upper_vel, axis=0)  # (dx, dy)
flow_mag  = np.linalg.norm(flow_vec)
flow_ang  = float(np.arctan2(flow_vec[1], flow_vec[0]))  # radians

scriptOp.outputChop['flow_magnitude'][0] = float(flow_mag)
scriptOp.outputChop['flow_angle'][0]     = flow_ang

Stored Energy (Postural Tension)

Shenyun (身韻), the principle of "body resonance" in Chinese classical movement pedagogy, describes how kinetic potential is accumulated before release (Lin, 2015). We approximate this as deviation from a neutral postural baseline — the body carrying tension that has not yet discharged.

# Stored energy: L2 distance of current pose from resting neutral
# neutral_pose must be calibrated per performer at session start

neutral = storage.get('neutral_pose', np.zeros((33, 3)))

deviation = curr - neutral
stored_energy = np.mean(np.linalg.norm(deviation, axis=1))

# Exponentially smoothed to avoid transient spikes
alpha = 0.15
prev_e = storage.get('prev_energy', 0.0)
smooth_energy = alpha * stored_energy + (1 - alpha) * prev_e
storage['prev_energy'] = smooth_energy

scriptOp.outputChop['stored_energy'][0] = float(smooth_energy)

Calibrate neutral_pose by averaging 60 frames of the performer standing relaxed at session start. These three channels — mean speed, directional flow vector, stored energy — form the continuous conditioning signal for the generative stage.


Stage 3: StreamDiffusion Connection

StreamDiffusion (Kodaira et al., 2023) achieves real-time latency by running a denoising pipeline over a stream of frames rather than per-request, using a residual classifier-free guidance (RCFG) variant that processes batched noise at pipeline throughput rather than per-image throughput. The practical consequence: it can sustain ~20 fps at 512×512 on a consumer GPU while accepting per-frame conditioning updates.

The connection from TD to StreamDiffusion runs most reliably over a local ZeroMQ socket or shared-memory Tensor. The following uses a Script DAT to push conditioning data via ZeroMQ to an external Python process hosting StreamDiffusion:

# TD side: push conditioning to StreamDiffusion process
import zmq, json, numpy as np

if 'zmq_socket' not in storage:
    ctx = zmq.Context()
    sock = ctx.socket(zmq.PUSH)
    sock.connect("tcp://localhost:5555")
    storage['zmq_socket'] = sock

sock = storage['zmq_socket']

chop = op('feature_chop')
cond = {
    'mean_speed':     float(chop['mean_speed'][0]),
    'flow_angle':     float(chop['flow_angle'][0]),
    'stored_energy':  float(chop['stored_energy'][0]),
    'prompt_weight':  float(chop['stored_energy'][0]) * 2.0  # energy amplifies prompt
}

sock.send_json(cond, zmq.NOBLOCK)

On the StreamDiffusion process side, cond['prompt_weight'] modulates the CFG scale dynamically, so high stored energy pushes the output toward a more strongly conditioned aesthetic state. For ControlNet-conditioned pipelines (Zhang et al., 2023), the pose skeleton image can be rendered to a TOP and transmitted as a Tensor alongside the scalar features, giving pose-level layout control on top of energetic conditioning.

AnimateDiff (Guo et al., 2023) is worth noting as a complementary architecture: it inserts motion modules into a diffusion UNet to enforce temporal consistency across frames. For installations where coherence matters more than absolute latency, a hybrid approach — StreamDiffusion for real-time response, AnimateDiff motion modules for smoothing — is viable at reduced frame rates.


The Honest Limitation: Temporal Coherence

Any practitioner who builds this pipeline will encounter the same wall: generative diffusion models have no intrinsic memory of the previous frame. Each synthesised image is probabilistically anchored to its conditioning but not to the prior output. The result is high-frequency visual flickering that breaks the phenomenological contract with the audience — the sense that the image is the body in a continuous transformation.

Current mitigations are partial. StreamDiffusion's residual noise scheduling reduces inter-frame variation but does not eliminate it (Kodaira et al., 2023). AnimateDiff's motion modules provide temporal smoothing within a batch but require batching that introduces latency. Image-space temporal smoothing (e.g., exponential blending of successive frames in a Feedback TOP) reduces flicker but softens edges in ways that can read as video compression artifact rather than aesthetic choice.

The deeper issue is that the body moves at 60 Hz and diffusion models reason at frame rates an order of magnitude slower. Until architecture improvements close that gap — or until the artistic strategy embraces flicker as texture rather than treating it as failure — temporal coherence remains the open problem in real-time somatic-generative pipelines.


References

Google. (2023). MediaPipe pose landmark detection. Google Developers. https://developers.google.com/mediapipe/solutions/vision/pose_landmarker

Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., & Dai, B. (2023). AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv. https://arxiv.org/abs/2307.04725

Kodaira, A., Xu, C., Yamaguchi, T., Tobita, H., Takanashi, K., Sato, T., & Tanaka, S. (2023). StreamDiffusion: A pipeline-level solution for real-time interactive generation. arXiv. https://arxiv.org/abs/2312.12491

Lin, Y. (2015). The aesthetics of Chinese dance: Body, space, and temporality. Temple University Press.

Zhang, L., Rao, A., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3836–3847). https://doi.org/10.1109/ICCV51070.2023.00355