MetaFine
META-EVALUATION FRAMEWORK · FINE-GRAINED MANIPULATION · 2026

Diagnosing Fine-Grained Manipulation Beyond Binary Success.

A diagnostic compass, not just a benchmark.

MetaFine is a diagnostic evaluation framework for fine-grained robotic manipulation. Rather than collapsing manipulation into a single binary success rate, MetaFine disentangles policy capability into three fundamental dimensions — understanding, perception, and behavior — and reveals the hidden failure modes that conventional benchmarks miss. Built on a compositional task graph and an extensible asset library, MetaFine can generate diverse fine-grained tasks, absorb heterogeneous benchmarks, and support both simulation-based diagnosis and hybrid real–sim evaluation. By turning evaluation from a leaderboard into a tool for scientific diagnosis, MetaFine provides the infrastructure needed to measure, understand, and ultimately improve genuine physical dexterity.

architecture · framework overview

From atomic primitives to diagnostic findings

01 · INPUTS ATOMIC SKILLS Composable Primitives grasp · align · insert · rotate slide · press · toggle + extensible ASSET LIBRARY Part-Aware Assets 431 obj · 1,078 parts 4,312 grasps + annotation tools 02 · ENGINE TASK GRAPH Compositional Task Graph Auto-compose tasks of arbitrary complexity absorbs external benchmarks RoboTwin · CALVIN · LIBERO 03 · 3-AXIS DIAGNOSE UNDERSTANDING semantic intervention PERCEPTION view + light perturbation BEHAVIOR stage success + smoothness 04 · OUTPUTS DIAGNOSTIC FINDINGS Findings for Fine-Grained Manipulation policy capabilities & failure modes HYBRID EVALUATION + FEW REAL MANY SIM → calibrated hybrid estimate

Features.

3 dimension
Three-dimensional diagnosis
Evaluate understanding, perception, and behavior separately to expose hidden failure modes.
composable
Atomic compositional skills and tasks
Compose arbitrary fine-grained tasks from reusable atomic manipulation skills.
fine-grained assets
Fine-grained assets
Build on fine-grained object annotations with rapid tools for scalable asset expansion.
real + sim
Hybrid real–sim evaluation
Bridge simulation and reality with hybrid evaluation under limited hardware budgets.
easy to use
Drop-in install
Get started with a single pip install metafine — no infrastructure setup required.
built-in agent
Agent + skill library
An embedded agent with reusable skills automates most simulation and evaluation tasks end-to-end.
sample tasks · scroll to browse
+ ∞
Arbitrarily extensible
compose any FG task
from atomic skills
01 · Overview

Binary success is not enough.

MetaFine is a diagnostic framework for fine-grained robotic manipulation. It moves beyond binary success rates to reveal hidden failures in understanding, perception, and behavior, enabling unified evaluation across benchmarks, simulation, and the real world.

three axes

Three-dimensional diagnosis

Disentangle manipulation into understanding, perception, and behavior instead of collapsing everything into binary success.

ecosystem

Extensible ecosystem

Compose fine-grained tasks from atomic skills, scale to thousands of part-aware assets, and absorb heterogeneous external benchmarks under a single diagnostic formalism.

specific findings

Pinpoint architectural bottlenecks

Surface dimension-specific failure modes — semantic grounding gaps, visual encoder ceilings, and action-head trade-offs — translating aggregate scores into actionable design insights.

real + sim

Hybrid real–sim evaluation

Combine scalable simulation with limited real-world rollouts for more stable performance estimation under scarce hardware budgets.

02 · Evaluation Protocol

A three-dimensional diagnostic framework.

MetaFine decomposes robotic manipulation into three complementary axes — enabling researchers to identify not only whether a policy fails, but exactly where and why it fails.

01 / UNDERSTANDING

Comprehend, or replay?

Probe whether models understand fine-grained semantic instructions, such as selecting a different part of the same object, rather than replaying memorized routines.

Robot gripper grasping the cap of a bottle
original
grasp the cap
Robot gripper grasping the body of a bottle
intervene
grasp the body
02 / PERCEPTION

Precise perception for fine-grained manipulation.

Probe whether policies preserve precise part-level perception at close range, under both geometric perturbations in viewpoint and pose and photometric perturbations in lighting and color.

Original viewpoint and lighting
original
Geometric perturbation: viewpoint change
view
Photometric perturbation: lighting change
light
03 / BEHAVIOR

Fine-grained behavior through stage-wise success and smoothness.

Behavior is measured through stage-wise success and trajectory smoothness, revealing whether policies can both progress through long-horizon tasks and execute controlled motions.

grasp align insert rotate slide press toggle + extensible
benchmark · 7 vlas × 6 tasks

Per-task success rate across state-of-the-art VLAs

best per task highlighted in red · tabs switch evaluation condition · bars re-animate on each switch
03 · Composable Manipulation

Composable fine-grained manipulation.

MetaFine builds tasks bottom-up from two reusable building blocks — atomic skill primitives and a part-aware asset library — enabling structured task generation, benchmark absorption, and scalable diagnostic coverage.

A · Atomic Fine-Grained Skills

Tasks composed from atomic primitives.

Each task is a graph: nodes are atomic skills (grasp-part, align, insert, rotate-along, slide, press-part, toggle-part) and edges encode dependencies. Paths through the graph define tasks of arbitrary complexity, enabling structured absorption of external benchmarks under a single formalism.

grasp-part align insert rotate-along slide press-part toggle-part + extensible
GRASP Acquire part ALIGN Position precisely INSERT Insert into slot ROTATE Rotate in place SLIDE Slide along axis GRASP Acquire part ALIGN Position precisely PRESS Press down TOGGLE Toggle state grasp align manipulation
B · Extensible Asset Library

Scalable, part-aware assets.

MetaFine ships with a part-annotated object library and rapid annotation tooling. Assets plug directly into the task graph, and external benchmarks are absorbed under the same compositional formalism for unified diagnostic comparison.

MetaFine asset library: parts, grasp poses, scenes
431
objects
1,078
annotated parts
4,312
grasp poses
04 · Key Findings

What MetaFine reveals.

We cannot build what we cannot measure.

Finding 01

Binary success inflates capability.

Policies that appear strong under binary benchmarks often fail once part-level semantic and physical constraints are enforced. Under disentangled evaluation, agents scoring 85% on conventional benchmarks can drop to 40%, revealing that binary success often conflates lucky completion with genuine fine-grained skill.

Finding 02 · Understanding

Training paradigm matters more than scale.

Semantic grounding is governed more by how a model is trained than by how large it is. End-to-end policies often conflate language with visual priors, whereas modular architectures better preserve fine-grained instruction-following under attribute-level semantic changes.

Finding 03 · Perception

Visual resolution sets a hard precision ceiling.

Fine-grained manipulation depends on whether the encoder can resolve precise part-level spatial structure. When spatial fidelity is insufficient, downstream action generation cannot recover the missing information, making visual resolution an absolute ceiling on manipulation precision.

Finding 04 · Perception

Geometric and photometric robustness fail differently.

Robustness is not a single property. Geometric robustness is tied primarily to encoder spatial representations, while photometric robustness depends more strongly on modality-level invariance and pretraining diversity, exposing distinct failure sources under visual change.

Finding 05 · Behavior

Behavior trades off stability and expressiveness.

Deterministic action generation produces stable but rigid trajectories, while stochastic generation offers more diverse corrections but is more vulnerable to accumulated spatial drift. Fine-grained execution therefore depends not only on control smoothness, but also on how different action paradigms propagate perceptual uncertainty.

Finding 06

Hybrid real–sim evaluation stabilizes estimation.

Small-batch hardware evaluation is inherently noisy and can misrepresent true policy performance. By calibrating large-scale simulation with a limited set of paired real rollouts, hybrid real–sim evaluation produces estimates that are substantially more stable and more faithful to real-world capability.

05 · Hybrid Real–Sim

Scalable real-world validation.

MetaFine integrates 3D Gaussian Splatting-based real-to-sim transfer with prediction-powered inference. A small number of paired real-world rollouts calibrates large-scale simulation — enabling stable, reproducible estimates of true policy performance under limited hardware budgets.

Real-world physical setup
01 · Real physical capture
real2sim · mobile device ↓
3DGS-reconstructed simulation scene
02 · Sim 3DGS reconstruction
policy rollout ↓
live
03 · Rollout policy execution
04 · PPI calibration

Limited real trials calibrate large-scale sim estimates.

prediction-powered inference
PPI calibration plot — comparison across conditions
06 · Cite

BibTeX

@article{xu2026metafine,
  title   = {Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation},
  author  = {Xu, He-Yang and Zhang, Pengyuan and Ge, Zongyuan and Hao, Xiaoshuai and Belongie, Serge and Geng, Xin and Peng, Yuxin and Wei, Xiu-Shen},
  journal = {arXiv preprint arXiv:2605.19986},
  year    = {2026}
}