User Guide · Evaluation · Results schema

Results schema

Every diagnostic eval run emits a single results.json file. This is the field-by-field reference.

Overview

One results.json per eval run. Top-level keys are stable; nested fields under each diagnostic dimension are versioned by the meta.version field. Treat unknown keys as forward-compatible; never strip them on round-trip.

Top-level layout

{
  "understanding": { ... },
  "perception":    { ... },
  "behavior":      { ... },
  "meta":          { ... }
}

understanding

FieldTypeDescription
per_stage_success{stage_name: float}Stage success rate over all eval episodes. Range [0, 1].
overallfloatFinal-stage success rate; the headline U-axis number.
n_episodesintTotal episodes in the U-axis aggregation.
stage_orderlist[string]Stage names in graph-declared order.

perception

FieldTypeDescription
ausc{axis: float}Per-axis AUSC (area-under-success-curve), normalised [0, 1]. Axes: light, view, rotation.
overallfloatMean of ausc values.
curve_points{axis: [[x, y], ...]}Raw success-vs-perturbation curve, normalised x axis ∈ [0, 1].
sweep_configobjectThe standard_dr_sweeps() config used (max perturbation per axis, n_steps, episodes_per_step).

behavior

FieldTypeDescription
jerk_rmsfloatEnd-effector jerk RMS in m/s³, averaged over successful episodes.
vel_varfloatNormalised velocity variance (dimensionless).
path_length_ratiofloatActual path length / straight-line distance. ≥ 1.0.
joint_jerk_rmsfloat (optional)Same metric in joint space; recommended for cross-policy comparisons.
frame_ratefloatEval rollout frame rate in Hz (default 25). Required for cross-paper comparisons.
aggregationstringOne of "success_only" (default), "all_episodes".

meta

FieldTypeDescription
versionstringResults-schema semver. Bumped on any breaking field change.
task_graphstringPath to the YAML used for this run.
policystringPolicy identifier (pi05, openvla-oft, …).
checkpointstringCheckpoint path or hash.
seedintEval seed.
timestampstringISO 8601.
commitstringGit SHA of the MetaFine repo at eval time.

Sample output

{
  "understanding": {
    "per_stage_success": { "engage": 0.93, "manipulate": 0.71, "release": 0.62 },
    "overall": 0.62,
    "n_episodes": 30,
    "stage_order": ["engage", "manipulate", "release"]
  },
  "perception": {
    "ausc": { "light": 0.81, "view": 0.74, "rotation": 0.55 },
    "overall": 0.70,
    "curve_points": {
      "light":    [[0,1.0],[0.2,0.97],[0.4,0.90],[0.6,0.83],[0.8,0.72],[1.0,0.55]],
      "view":     [[0,1.0],[0.2,0.95],[0.4,0.83],[0.6,0.70],[0.8,0.58],[1.0,0.40]],
      "rotation": [[0,1.0],[0.2,0.83],[0.4,0.60],[0.6,0.41],[0.8,0.30],[1.0,0.18]]
    },
    "sweep_config": { "n_steps": 6, "episodes_per_step": 20 }
  },
  "behavior": {
    "jerk_rms": 0.082, "vel_var": 0.011, "path_length_ratio": 1.18,
    "frame_rate": 25, "aggregation": "success_only"
  },
  "meta": {
    "version": "0.1.0",
    "task_graph": "configs/grasp_and_lift_cap.yaml",
    "policy": "pi05",
    "checkpoint": "/nat/demos/pi05/run_2026_05_12/step_50000.pt",
    "seed": 42,
    "timestamp": "2026-05-13T14:30:11Z",
    "commit": "5a6c780"
  }
}