Policy Evaluation#

This page defines the FGManip evaluation protocol for fine-grained manipulation policies.

Evaluation Metrics#

Metrics are grouped into three dimensions: Behavior, Understanding, and Perception.

Behavior#

Success Rate: average number of successful episodes over 100 evaluation trials.
Trajectory Smooth (Stability): action smoothness metric computed as:

Stability = exp(-(1 / (T - 1)) * sum_{t=1}^{T} ||a_t - a_{t-1}||_2)

Understanding#

Instruction Perturbation Success Rate: success rate after perturbing or scrambling task instructions.

Perception#

Generalization of Success AUSC across Lighting Perturbations: success under three levels of lighting perturbation and corresponding AUSC.

Generalization of Success AUSC across Camera View Perturbations: success under three levels of camera-view perturbation and corresponding AUSC.

Leaderboard (Coming Soon)#

We will publish a benchmark leaderboard with standardized splits, evaluation seeds, and result submission format.