User Guide · Evaluation · Real-sim hybrid (PPI)

Real-sim hybrid evaluation (PPI)

PPI — Phone-Pipeline-Import — turns any real-world articulated object into a part-aware MetaFine asset, so the same diagnostic protocol runs on both sides. A hosted online evaluation platform is coming soon.

Why hybrid?

Pure simulation is fast and reproducible but never quite real; pure real-world evaluation is faithful but doesn't scale to dozens of policies × dozens of objects × dozens of perturbation conditions. The hybrid path puts both sides on the same task graph with the same predicates and the same diagnostic axes — the only thing that changes is whether the rollout happens in SAPIEN or on the bench.

PPI is the on-ramp: it takes a phone scan of a real object and produces a MetaFine asset (URDF + meshes + capabilities.json) that is indistinguishable from a PartNet-Mobility asset, so all the existing skills and task graphs just work.

The four steps

Phone scan

Capture with a LiDAR phone.

Process

Reconstruct, segment, derive affordances.

Import

Drop into the asset library.

Reproduce

Same diagnostic eval, sim + real.

01 · Phone scan

Any LiDAR-equipped phone (iPhone Pro / Pro Max from 12 onwards, Pixel 8 Pro+, plus most flagship Android phones with depth sensors) captures a multi-view scan in 30–60 seconds. The scanner records:

Dense point cloud + textured mesh of the object;
Per-view RGB + intrinsics;
A short manipulation video showing how the moving parts articulate.

The articulation video is the key signal — it tells the next stage which sub-meshes belong to which kinematic part.

02 · Process

The processing pipeline takes the raw scan and emits a clean MetaFine asset:

Reconstruct. Refine the mesh; estimate the rest pose; align to a canonical axis.
Segment parts. Use the articulation video to split the mesh into rigid parts; estimate per-part joint axes and limits from the motion observed.
Derive affordances. Run utils.derive_capabilities over the segmented URDF — joint type + a synonym table propose an initial capabilities.json.
Human QA. A short interactive review (utils.review_capabilities) lets you confirm / toggle / walk through the proposed affordances.

The output is a directory laid out exactly like an existing asset.

03 · Import

Move the processed directory into the asset library:

$ mv ~/scans/my_jar  assets/my_jar/
# The directory contains: urdf.xml  meshes/  capabilities.json  model_data.json

No further registration is needed — SKILL_REGISTRY lookups consult capabilities.json at runtime, so the new asset is immediately eligible for every skill whose affordance contract it satisfies.

04 · Reproduce

Run an existing task graph against the new asset; the three diagnostic dimensions are computed exactly as before:

$ python core/policies/pi05/evaluate.py \
    --task-graph configs/grasp_and_lift_cap.yaml \
    --asset-override my_jar \
    --checkpoint /path/to/ckpt \
    --episodes 30

For matched real-world evaluation, replay the same task graph on the physical object using your real-robot stack. Submit both results.json files to the hybrid comparison report to inspect sim-to-real gap on each diagnostic axis.

Coming soon: the unified evaluation platform

Coming soon unified evaluation platform

Scan, upload, get diagnosed — without the local pipeline.

A hosted MetaFine instance will run the full PPI pipeline + diagnostic evaluation in the cloud. Submit a phone scan and your policy checkpoint; receive a complete results.json against the public leaderboard. The platform consolidates real + sim evaluation under one schema, so cross-policy and cross-paper comparisons stay apples-to-apples.

Waitlist

The hosted platform is in private alpha. To be notified when the public submission portal opens, watch the project repo on GitHub or follow the project home page for release announcements.