Introduction
MetaFine is a diagnostic evaluation framework for fine-grained robotic manipulation. Rather than collapsing manipulation into a single binary success rate, MetaFine disentangles policy capability into three fundamental dimensions — understanding, perception, and behavior — and surfaces the hidden failure modes that conventional benchmarks miss.
What is MetaFine?
MetaFine is an open platform purpose-built for fine-grained manipulation research. It connects part-level interaction modeling, scalable data generation, policy training, and standardised evaluation into one end-to-end workflow — designed for both fast iteration in early research and rigorous benchmarking at publication.
The platform is built on a compositional task graph and an extensible asset library so it can generate diverse fine-grained tasks, absorb heterogeneous benchmarks under one schema, and support both pure simulation and hybrid real–sim evaluation.
The three diagnostic dimensions
Conventional benchmarks ask one question: did the policy succeed? A yes/no answer hides which part of the system failed. MetaFine's premise is that any meaningful evaluation answers three questions at once:
- Understanding — Did the policy know what to do, in the right order? Measured by per-stage success rates over a multi-step task graph; surfaces where the chain breaks (engagement → manipulation → release).
- Perception — Did the policy correctly process its sensory inputs under variation? Measured by domain-randomisation sweeps with AUSC (area-under-success-curve) for lighting, camera pose, and camera rotation — a normalised 0-to-1 score per axis.
- Behavior — Did the policy execute its plan smoothly? Measured by action-trajectory smoothness (jerk RMS, velocity variance, path length) — exposes jerky, hesitant, or chunk-of-N-artefact policies that happen to "succeed".
Every evaluation run emits a single results.json carrying all three, so two policies can be compared apples-to-apples across the full diagnostic plane, not just on a headline number.
The pipeline at a glance
Record → merge → replay → convert → train → evaluate. The condensed commands are below; the Quickstart walks each step. First, a sample of tasks MetaFine ships with — scroll horizontally to browse.
# 1. Record expert demos — single skill, or --task-graph for a multi-stage env $ python record.py -e grasp_part --object-name 100221 --part-name cap -n 5 --only-count-success # 2. Merge shards → 3. Replay (render observations) $ python utils/merge_trajectory.py -i demos/grasp_part/100221 -o demos/grasp_part/100221/trajectory.h5 -p trajectory.h5 $ python utils/replay_trajectory.py --traj-path demos/grasp_part/100221/trajectory.h5 -o rgb -c pd_joint_delta_pos -b physx_cpu --use-first-env-state --save-traj --save-video # 4. Convert for training (LeRobot, or convert_to_rlds for OpenVLA) $ python utils/convert_to_lerobot.py --traj-path .../trajectory.rgb.pd_joint_delta_pos.physx_cpu.h5 --output-dir lerobot_grasp_part --task-name "Grasp the cap." --fps 30 --robot-type panda # 5. Train (LeRobot / StarVLA) → 6. Evaluate the checkpoint closed-loop $ python core/policies/pi05/evaluate.py --policy-path /path/to/model --env-id grasp_part --object-name 100221 --part-name cap --n-episodes 50 --task "Grasp the cap."
Beyond simulation: real-sim hybrid evaluation
Pure simulation is fast but never quite real; pure real-world evaluation is faithful but doesn't scale. MetaFine bridges the two with PPI (Phone-Pipeline-Import) — a four-step path that turns any real-world object into a part-aware MetaFine asset, so the exact same diagnostic protocol runs on both sides.
capabilities.json.assets/<your-object>/. The skill registry auto-matches applicable skills via the affordance lookup.results.json against the public leaderboard.Why MetaFine?
Two policies with the same headline success rate can have totally different results.json profiles. MetaFine is designed to make that difference visible, on a single comparable plane. Three things make this practical at scale:
- Composable skills. 21 affordance-typed atomic skills (grasp, rotate, slide, insert, …) compose into multi-step task graphs via YAML or Python. Adding a long-horizon task is a 30-line YAML, not a new env class.
- Affordance-aware asset library. 40+ part-annotated articulated objects, each declaring its part-level affordances in
capabilities.json. The skill ↔ asset compatibility check is a closed-set lookup, not a heuristic. - VLA-ready. A shared data pipeline (record → merge → replay → convert) feeds LeRobot and RLDS exports. Seven backbones are vendored (ACT / DP3 / OpenVLA / OpenVLA-OFT / π0 / π0.5 / StarVLA); training is verified via the LeRobot and StarVLA paths, and π0.5 closed-loop inference is verified.
How this guide is organised
The user guide is grouped into four major tracks. Start with Getting Started if you're new; skip ahead by topic if you're integrating a specific piece.
pip install -e ., dataset download, your first recorded demo.