Policy Evaluation#
This page defines the FGManip evaluation protocol for fine-grained manipulation policies.
Evaluation Metrics#
Metrics are grouped into three dimensions: Behavior, Understanding, and Perception.
Behavior#
Success Rate: average number of successful episodes over 100 evaluation trials.
Trajectory Smooth (Stability): action smoothness metric computed as:
Stability = exp(-(1 / (T - 1)) * sum_{t=1}^{T} ||a_t - a_{t-1}||_2)Understanding#
Instruction Perturbation Success Rate: success rate after perturbing or scrambling task instructions.
Perception#
Generalization of Success AUSC across Lighting Perturbations: success under three levels of lighting perturbation and corresponding AUSC.
Generalization of Success AUSC across Camera View Perturbations: success under three levels of camera-view perturbation and corresponding AUSC.
Leaderboard (Coming Soon)#
We will publish a benchmark leaderboard with standardized splits, evaluation seeds, and result submission format.