User Guide · Introduction

Introduction

MetaFine is a diagnostic evaluation framework for fine-grained robotic manipulation. Rather than collapsing manipulation into a single binary success rate, MetaFine disentangles policy capability into three fundamental dimensions — understanding, perception, and behavior — and surfaces the hidden failure modes that conventional benchmarks miss.

Understanding

Per-stage success surfaces where the chain breaks.

Perception

DR sweeps + AUSC normalise robustness to lighting, view, jitter.

Behavior

Trajectory smoothness exposes jerky, hesitant, chunk-artefact policies.

What is MetaFine?

MetaFine is an open platform purpose-built for fine-grained manipulation research. It connects part-level interaction modeling, scalable data generation, policy training, and standardised evaluation into one end-to-end workflow — designed for both fast iteration in early research and rigorous benchmarking at publication.

The platform is built on a compositional task graph and an extensible asset library so it can generate diverse fine-grained tasks, absorb heterogeneous benchmarks under one schema, and support both pure simulation and hybrid real–sim evaluation.

The three diagnostic dimensions

Conventional benchmarks ask one question: did the policy succeed? A yes/no answer hides which part of the system failed. MetaFine's premise is that any meaningful evaluation answers three questions at once:

Understanding — Did the policy know what to do, in the right order? Measured by per-stage success rates over a multi-step task graph; surfaces where the chain breaks (engagement → manipulation → release).
Perception — Did the policy correctly process its sensory inputs under variation? Measured by domain-randomisation sweeps with AUSC (area-under-success-curve) for lighting, camera pose, and camera rotation — a normalised 0-to-1 score per axis.
Behavior — Did the policy execute its plan smoothly? Measured by action-trajectory smoothness (jerk RMS, velocity variance, path length) — exposes jerky, hesitant, or chunk-of-N-artefact policies that happen to "succeed".

Every evaluation run emits a single results.json carrying all three, so two policies can be compared apples-to-apples across the full diagnostic plane, not just on a headline number.

The pipeline at a glance

Record → merge → replay → convert → train → evaluate. The condensed commands are below; the Quickstart walks each step. First, a sample of tasks MetaFine ships with — scroll horizontally to browse.

sample tasks · scroll to browse

grasp · cap

grasp · body

insert

peg-in-hole

toggle-switch

rotate-along

stack-cube

open-box

long-horizon

extensible

+ ∞

Compose your own

from atomic skills
+ part-aware assets

# 1. Record expert demos — single skill, or --task-graph for a multi-stage env
$ python record.py -e grasp_part --object-name 100221 --part-name cap -n 5 --only-count-success

# 2. Merge shards → 3. Replay (render observations)
$ python utils/merge_trajectory.py -i demos/grasp_part/100221 -o demos/grasp_part/100221/trajectory.h5 -p trajectory.h5
$ python utils/replay_trajectory.py --traj-path demos/grasp_part/100221/trajectory.h5 -o rgb -c pd_joint_delta_pos -b physx_cpu --use-first-env-state --save-traj --save-video

# 4. Convert for training (LeRobot, or convert_to_rlds for OpenVLA)
$ python utils/convert_to_lerobot.py --traj-path .../trajectory.rgb.pd_joint_delta_pos.physx_cpu.h5 --output-dir lerobot_grasp_part --task-name "Grasp the cap." --fps 30 --robot-type panda

# 5. Train (LeRobot / StarVLA)  →  6. Evaluate the checkpoint closed-loop
$ python core/policies/pi05/evaluate.py --policy-path /path/to/model --env-id grasp_part --object-name 100221 --part-name cap --n-episodes 50 --task "Grasp the cap."

Beyond simulation: real-sim hybrid evaluation

Pure simulation is fast but never quite real; pure real-world evaluation is faithful but doesn't scale. MetaFine bridges the two with PPI (Phone-Pipeline-Import) — a four-step path that turns any real-world object into a part-aware MetaFine asset, so the exact same diagnostic protocol runs on both sides.

Step 01

Phone scan

Capture the object with any LiDAR-equipped phone — point cloud, mesh, RGB pose all in one sweep.

Step 02

Process

Reconstruct geometry, segment parts, derive affordances — yields a URDF + capabilities.json.

Step 03

Import

Drop into assets/<your-object>/. The skill registry auto-matches applicable skills via the affordance lookup.

Step 04

Reproduce

Run the same diagnostic eval — Understanding / Perception / Behavior — paired against your real-world rollouts.

Coming soon unified evaluation platform

Scan, upload, get diagnosed.

A hosted MetaFine instance for real + sim policies. Submit a phone scan and your policy checkpoint; we run the four-step PPI pipeline + diagnostic eval and return a full results.json against the public leaderboard.

Why MetaFine?

Two policies with the same headline success rate can have totally different results.json profiles. MetaFine is designed to make that difference visible, on a single comparable plane. Three things make this practical at scale:

Composable skills. 21 affordance-typed atomic skills (grasp, rotate, slide, insert, …) compose into multi-step task graphs via YAML or Python. Adding a long-horizon task is a 30-line YAML, not a new env class.
Affordance-aware asset library. 40+ part-annotated articulated objects, each declaring its part-level affordances in capabilities.json. The skill ↔ asset compatibility check is a closed-set lookup, not a heuristic.
VLA-ready. A shared data pipeline (record → merge → replay → convert) feeds LeRobot and RLDS exports. Seven backbones are vendored (ACT / DP3 / OpenVLA / OpenVLA-OFT / π0 / π0.5 / StarVLA); training is verified via the LeRobot and StarVLA paths, and π0.5 closed-loop inference is verified.

How this guide is organised

The user guide is grouped into four major tracks. Start with Getting Started if you're new; skip ahead by topic if you're integrating a specific piece.

01 · Getting Started

Install & quickstart

Conda env, pip install -e ., dataset download, your first recorded demo.

02 · Core Concepts

Skills, affordances, task graphs

The 21-skill / 11-affordance / predicate-DSL trio that lets you compose new tasks.

03 · Evaluation

The three-dimensional protocol

Stage success, DR-AUSC, smoothness — what each metric measures and how to read it.

04 · Policies

Run a VLA on MetaFine

ACT, DP3, OpenVLA(-OFT), π0/π0.5, StarVLA — install, train, evaluate with one CLI.