Every diagnostic eval run emits a single results.json file. This is the field-by-field reference.
Overview
One results.json per eval run. Top-level keys are stable; nested fields under each diagnostic dimension are versioned by the meta.version field. Treat unknown keys as forward-compatible; never strip them on round-trip.
Top-level layout
{
"understanding": { ... },
"perception": { ... },
"behavior": { ... },
"meta": { ... }
}
understanding
| Field | Type | Description |
per_stage_success | {stage_name: float} | Stage success rate over all eval episodes. Range [0, 1]. |
overall | float | Final-stage success rate; the headline U-axis number. |
n_episodes | int | Total episodes in the U-axis aggregation. |
stage_order | list[string] | Stage names in graph-declared order. |
perception
| Field | Type | Description |
ausc | {axis: float} | Per-axis AUSC (area-under-success-curve), normalised [0, 1]. Axes: light, view, rotation. |
overall | float | Mean of ausc values. |
curve_points | {axis: [[x, y], ...]} | Raw success-vs-perturbation curve, normalised x axis ∈ [0, 1]. |
sweep_config | object | The standard_dr_sweeps() config used (max perturbation per axis, n_steps, episodes_per_step). |
behavior
| Field | Type | Description |
jerk_rms | float | End-effector jerk RMS in m/s³, averaged over successful episodes. |
vel_var | float | Normalised velocity variance (dimensionless). |
path_length_ratio | float | Actual path length / straight-line distance. ≥ 1.0. |
joint_jerk_rms | float (optional) | Same metric in joint space; recommended for cross-policy comparisons. |
frame_rate | float | Eval rollout frame rate in Hz (default 25). Required for cross-paper comparisons. |
aggregation | string | One of "success_only" (default), "all_episodes". |
| Field | Type | Description |
version | string | Results-schema semver. Bumped on any breaking field change. |
task_graph | string | Path to the YAML used for this run. |
policy | string | Policy identifier (pi05, openvla-oft, …). |
checkpoint | string | Checkpoint path or hash. |
seed | int | Eval seed. |
timestamp | string | ISO 8601. |
commit | string | Git SHA of the MetaFine repo at eval time. |
Sample output
{
"understanding": {
"per_stage_success": { "engage": 0.93, "manipulate": 0.71, "release": 0.62 },
"overall": 0.62,
"n_episodes": 30,
"stage_order": ["engage", "manipulate", "release"]
},
"perception": {
"ausc": { "light": 0.81, "view": 0.74, "rotation": 0.55 },
"overall": 0.70,
"curve_points": {
"light": [[0,1.0],[0.2,0.97],[0.4,0.90],[0.6,0.83],[0.8,0.72],[1.0,0.55]],
"view": [[0,1.0],[0.2,0.95],[0.4,0.83],[0.6,0.70],[0.8,0.58],[1.0,0.40]],
"rotation": [[0,1.0],[0.2,0.83],[0.4,0.60],[0.6,0.41],[0.8,0.30],[1.0,0.18]]
},
"sweep_config": { "n_steps": 6, "episodes_per_step": 20 }
},
"behavior": {
"jerk_rms": 0.082, "vel_var": 0.011, "path_length_ratio": 1.18,
"frame_rate": 25, "aggregation": "success_only"
},
"meta": {
"version": "0.1.0",
"task_graph": "configs/grasp_and_lift_cap.yaml",
"policy": "pi05",
"checkpoint": "/nat/demos/pi05/run_2026_05_12/step_50000.pt",
"seed": 42,
"timestamp": "2026-05-13T14:30:11Z",
"commit": "5a6c780"
}
}