User Guide · Evaluation · Behavior

Behavior

The Behavior axis answers “did the policy execute its plan smoothly?” via three trajectory-level metrics — jerk RMS, velocity variance, and path length — that surface jerky, hesitant, or chunk-of-N-artefact policies that happen to succeed.

Why behaviour?

Two policies can produce identical success rates while moving in completely different ways. One executes a clean, smooth motion at constant pace; the other oscillates wildly, snaps between waypoints, or pauses for milliseconds between every action chunk. From a deployment perspective those policies are not interchangeable — one is safe near a human; the other isn't.

The Behavior axis turns motion quality into three scalars that you can read at a glance and that catch a category of failure modes the other two axes are blind to.

Jerk RMS

The root-mean-square of the third time-derivative of the end-effector position, integrated over the rollout. Jerk is the canonical objective signal that motion is "smooth" — humans grade motions on jerk; safety standards constrain it; jerky control aliases are a well-known artefact of chunked autoregressive policies.

jerk_rms = sqrt(mean over t of ||d³p/dt³||²)

Reported in m/s³. Lower is smoother. A skilled human teleop demonstration sits around 0.05; a healthy diffusion policy is typically 0.05–0.15; chunked autoregressive policies often hit 0.5+ at chunk boundaries.

Velocity variance

The variance of the magnitude of the end-effector velocity across the rollout, normalised to the mean velocity. Captures hesitation and stop-start behaviour that jerk RMS can miss when motion is locally smooth but globally inconsistent.

vel_var = Var(||dp/dt||) / max(Mean(||dp/dt||), ε)

Dimensionless. Lower is more uniform; values above ~0.5 typically indicate the policy is stopping and restarting.

Path-length ratio

The actual end-effector path length divided by the straight-line distance between start and goal. The lower bound is 1.0 (perfectly direct); large values indicate detours, retracing, or oscillation.

path_length_ratio = ∫ ||dp/dt|| dt / ||p_end − p_start||

Dimensionless. Pure motion-planning solvers hit 1.0–1.2 routinely; learned policies on well-trained tasks land at 1.2–1.5; values past 2.0 signal pathological motion or repeated retries.

Reading the B axis

{
  "behavior": {
    "jerk_rms":          0.082,
    "vel_var":           0.011,
    "path_length_ratio": 1.18
  }
}

The three numbers are deliberately not aggregated into one — they probe different phenomena. Read them together with the U-axis to see whether smoothness is being paid for with failure or vice versa.

Caveats

  • Frame rate matters. Jerk RMS is computed by finite differences. The eval runs at 25 Hz by default; report the rate alongside any cross-paper comparison.
  • Success-conditioned reporting. Behaviour metrics are aggregated over successful episodes only by default — averaging over collapse-to-zero rollouts is misleading. Toggle via --smoothness-all-episodes.
  • End effector vs. joint. Default is end-effector position in world frame. For comparing chunked-action policies fairly, also inspect joint-space jerk in results.json:behavior.joint_jerk_rms when reported.