User Guide · Evaluation · Perception

Perception

The Perception axis answers “did the policy correctly process its inputs under variation?” via domain-randomisation sweeps along lighting, viewpoint, and rotation — each summarised by a normalised AUSC (area-under-success-curve) score from 0 to 1.

Why DR sweeps?

A policy can score 100% on the same lighting / camera / rotation that produced its training distribution while collapsing the moment a window opens or a camera shifts. The headline number flatters the brittleness. DR sweeps explicitly probe variation along axes that real deployments will encounter, then compress the resulting curve into one comparable number per axis.

The three DR axes

AxisWhat variesRange
lightIntensity, colour temperature, and direction of the scene lights.0.2× to 2.0× nominal intensity; ±30° azimuth; ±1500K CCT.
viewCamera translation around the nominal viewpoint.±15 cm in each of x / y / z.
rotationCamera rotation about the look-at point.±20° azimuth; ±10° elevation; ±5° roll.

Each axis is swept with a fixed number of perturbation levels (default: 6 steps from 0 to the maximum). For each level, the same task graph runs N episodes; the success rate is recorded as a point on the axis's curve.

AUSC normalisation

The raw output of one axis is a curve success vs. perturbation magnitude. AUSC compresses this curve to a scalar:

AUSC = ∫₀¹ success(perturbation_normalised) d(perturbation) ∈ [0, 1]

The perturbation axis is normalised so that 0 is no perturbation and 1 is the configured maximum. AUSC of 1.0 means the policy is fully robust over the entire configured range; 0.5 means it falls off linearly; 0 means it collapses immediately. The normalisation makes AUSC scores comparable across axes and across policies.

Reading the P axis

{
  "perception": {
    "ausc": {
      "light":    0.81,
      "view":     0.74,
      "rotation": 0.55
    },
    "overall": 0.70,
    "curve_points": {
      "light":    [[0, 1.0], [0.2, 0.97], [0.4, 0.90], [0.6, 0.83], [0.8, 0.72], [1.0, 0.55]],
      "view":     [[0, 1.0], [0.2, 0.95], [0.4, 0.83], [0.6, 0.70], [0.8, 0.58], [1.0, 0.40]],
      "rotation": [[0, 1.0], [0.2, 0.83], [0.4, 0.60], [0.6, 0.41], [0.8, 0.30], [1.0, 0.18]]
    }
  }
}

The overall Perception score is the mean of the three AUSC values; curve_points preserves the full curve so you can re-plot it without re-running.

Configuring sweeps

The default sweep configuration ships in utils/eval_sweep.py via standard_dr_sweeps(). Override at the call site:

from utils.eval_sweep import dr_sweep, standard_dr_sweeps

cfgs = standard_dr_sweeps()
cfgs["view"].steps = 10           # finer-grained sweep on view
cfgs["view"].max_translation = 0.20  # widen to ±20 cm

result = dr_sweep(env, graph, policy, episodes_per_step=20, configs=cfgs)
print(result.ausc)
# {'light': 0.81, 'view': 0.71, 'rotation': 0.55}

The number of episodes-per-step is the main cost knob — 20 is the default; bump to 50 for low-variance final numbers.

Caveats

  • AUSC ≠ worst case. A high AUSC with a steep cliff at the end can still be fragile. Inspect curve_points directly when robustness is the headline claim.
  • Axes interact. Real deployments perturb several axes at once. The product of three healthy single-axis AUSCs overestimates joint robustness; treat single-axis numbers as necessary, not sufficient.
  • Training overlap. If your training data already includes the DR ranges, you're measuring fit, not generalisation. Report which augmentations are in-distribution.