User Guide · Evaluation · Behavior

Behavior

The Behavior axis answers “did the policy execute its plan smoothly?” via three trajectory-level metrics — jerk RMS, velocity variance, and path length — that surface jerky, hesitant, or chunk-of-N-artefact policies that happen to succeed.

Why behaviour?

Two policies can produce identical success rates while moving in completely different ways. One executes a clean, smooth motion at constant pace; the other oscillates wildly, snaps between waypoints, or pauses for milliseconds between every action chunk. From a deployment perspective those policies are not interchangeable — one is safe near a human; the other isn't.

The Behavior axis turns motion quality into three scalars that you can read at a glance and that catch a category of failure modes the other two axes are blind to.

Jerk RMS

The root-mean-square of the third time-derivative of the end-effector position, integrated over the rollout. Jerk is the canonical objective signal that motion is "smooth" — humans grade motions on jerk; safety standards constrain it; jerky control aliases are a well-known artefact of chunked autoregressive policies.

jerk_rms = sqrt(mean over t of ||d³p/dt³||²)

Reported in m/s³. Lower is smoother. A skilled human teleop demonstration sits around 0.05; a healthy diffusion policy is typically 0.05–0.15; chunked autoregressive policies often hit 0.5+ at chunk boundaries.

Velocity variance

The variance of the magnitude of the end-effector velocity across the rollout, normalised to the mean velocity. Captures hesitation and stop-start behaviour that jerk RMS can miss when motion is locally smooth but globally inconsistent.

vel_var = Var(||dp/dt||) / max(Mean(||dp/dt||), ε)

Dimensionless. Lower is more uniform; values above ~0.5 typically indicate the policy is stopping and restarting.

Path-length ratio

The actual end-effector path length divided by the straight-line distance between start and goal. The lower bound is 1.0 (perfectly direct); large values indicate detours, retracing, or oscillation.

path_length_ratio = ∫ ||dp/dt|| dt / ||p_end − p_start||

Dimensionless. Pure motion-planning solvers hit 1.0–1.2 routinely; learned policies on well-trained tasks land at 1.2–1.5; values past 2.0 signal pathological motion or repeated retries.

Reading the B axis

{
  "behavior": {
    "jerk_rms":          0.082,
    "vel_var":           0.011,
    "path_length_ratio": 1.18
  }
}

The three numbers are deliberately not aggregated into one — they probe different phenomena. Read them together with the U-axis to see whether smoothness is being paid for with failure or vice versa.

Caveats

Frame rate matters. Jerk RMS is computed by finite differences. The eval runs at 25 Hz by default; report the rate alongside any cross-paper comparison.
Success-conditioned reporting. Behaviour metrics are aggregated over successful episodes only by default — averaging over collapse-to-zero rollouts is misleading. Toggle via --smoothness-all-episodes.
End effector vs. joint. Default is end-effector position in world frame. For comparing chunked-action policies fairly, also inspect joint-space jerk in results.json:behavior.joint_jerk_rms when reported.