Behavior
The Behavior axis answers “did the policy execute its plan smoothly?” via three trajectory-level metrics — jerk RMS, velocity variance, and path length — that surface jerky, hesitant, or chunk-of-N-artefact policies that happen to succeed.
Why behaviour?
Two policies can produce identical success rates while moving in completely different ways. One executes a clean, smooth motion at constant pace; the other oscillates wildly, snaps between waypoints, or pauses for milliseconds between every action chunk. From a deployment perspective those policies are not interchangeable — one is safe near a human; the other isn't.
The Behavior axis turns motion quality into three scalars that you can read at a glance and that catch a category of failure modes the other two axes are blind to.
Jerk RMS
The root-mean-square of the third time-derivative of the end-effector position, integrated over the rollout. Jerk is the canonical objective signal that motion is "smooth" — humans grade motions on jerk; safety standards constrain it; jerky control aliases are a well-known artefact of chunked autoregressive policies.
jerk_rms = sqrt(mean over t of ||d³p/dt³||²)
Reported in m/s³. Lower is smoother. A skilled human teleop demonstration sits around 0.05; a healthy diffusion policy is typically 0.05–0.15; chunked autoregressive policies often hit 0.5+ at chunk boundaries.
Velocity variance
The variance of the magnitude of the end-effector velocity across the rollout, normalised to the mean velocity. Captures hesitation and stop-start behaviour that jerk RMS can miss when motion is locally smooth but globally inconsistent.
vel_var = Var(||dp/dt||) / max(Mean(||dp/dt||), ε)
Dimensionless. Lower is more uniform; values above ~0.5 typically indicate the policy is stopping and restarting.
Path-length ratio
The actual end-effector path length divided by the straight-line distance between start and goal. The lower bound is 1.0 (perfectly direct); large values indicate detours, retracing, or oscillation.
path_length_ratio = ∫ ||dp/dt|| dt / ||p_end − p_start||
Dimensionless. Pure motion-planning solvers hit 1.0–1.2 routinely; learned policies on well-trained tasks land at 1.2–1.5; values past 2.0 signal pathological motion or repeated retries.
Reading the B axis
{
"behavior": {
"jerk_rms": 0.082,
"vel_var": 0.011,
"path_length_ratio": 1.18
}
}
The three numbers are deliberately not aggregated into one — they probe different phenomena. Read them together with the U-axis to see whether smoothness is being paid for with failure or vice versa.
Caveats
- Frame rate matters. Jerk RMS is computed by finite differences. The eval runs at 25 Hz by default; report the rate alongside any cross-paper comparison.
- Success-conditioned reporting. Behaviour metrics are aggregated over successful episodes only by default — averaging over collapse-to-zero rollouts is misleading. Toggle via
--smoothness-all-episodes. - End effector vs. joint. Default is end-effector position in world frame. For comparing chunked-action policies fairly, also inspect joint-space jerk in
results.json:behavior.joint_jerk_rmswhen reported.