User Guide · Evaluation · Understanding

Understanding

The Understanding axis answers “did the policy know what to do, in the right order?” by surfacing per-stage success rates over a multi-step task graph — so you see exactly where the chain breaks.

Why per-stage?

A binary task-level success rate is a noisy summary of many things at once: did the policy recognise the right target? Did it engage the right part? Did it execute the right motion? Did it release cleanly? Two policies with the same overall success can fail in completely different ways — one drops 30% of trials at engagement; another reaches the last stage 95% of the time and fumbles release.

The Understanding axis exposes that breakdown. By writing your task as a sequence of stages with per-stage success predicates, the eval pipeline tells you exactly which stage's failure rate explains the overall number.

The metric

For each stage s in the task graph, MetaFine computes:

per_stage_success[s] = mean over episodes of (1 if predicate(s) was satisfied at any point during stage s, else 0)

The overall Understanding score is the success of the final stage — by construction, the task as a whole is achieved only if the last predicate holds.

Reading the U axis

The results.json block is small and immediately readable:

{
  "understanding": {
    "per_stage_success": {
      "engage":     0.93,
      "manipulate": 0.71,
      "release":    0.62
    },
    "overall": 0.62
  }
}

The diagnostic reading is mechanical: engage looks healthy (93%); manipulate is where most of the loss sits (71% × 93% ≈ 66% would-be conditional pass-through, but the next stage drops further). release at 62% says about 9 percentage points are lost at the very end.

Configuring stages

Stages are named in the task-graph YAML. The validator enforces that each stage has a target part and (optionally) a success predicate. If a stage omits its predicate, it defaults to "skill solver returned without error" — useful for free-style continuation skills that don't have a clean predicate-level definition.

stages:
  - stage_name: engage      # labels the U-axis row
    skill: grasp_part
    target: { object: 100221, part: cap }
    success: grasped("cap")

  - stage_name: manipulate
    skill: pure_rotate
    params: { angle_deg: 90 }
    success: and(grasped("cap"), joint_value("cap", ">=", 1.40))

  - stage_name: release
    skill: release_gripper
    success: not(grasped("cap"))

Caveats

  • Stage windows. A stage's predicate is checked while that stage is the active stage. If a later stage's failure undoes an earlier predicate, the earlier stage still counts as a success.
  • Predicate granularity. Over-tight tolerances inflate failure rates; the defaults are chosen to match what a human evaluator would call "this worked".
  • Sample size. With 30 episodes, per-stage rates have ±10pp standard error. Read the trend across stages, not the absolute numbers.