META-EVALUATION FRAMEWORK · FINE-GRAINED MANIPULATION · 2026

Diagnosing Fine-Grained Manipulation Beyond Binary Success.

A diagnostic compass, not just a benchmark.

MetaFine is a diagnostic evaluation framework for fine-grained robotic manipulation. Rather than collapsing manipulation into a single binary success rate, MetaFine disentangles policy capability into three fundamental dimensions — understanding, perception, and behavior — and reveals the hidden failure modes that conventional benchmarks miss. Built on a compositional task graph and an extensible asset library, MetaFine can generate diverse fine-grained tasks, absorb heterogeneous benchmarks, and support both simulation-based diagnosis and hybrid real–sim evaluation. By turning evaluation from a leaderboard into a tool for scientific diagnosis, MetaFine provides the infrastructure needed to measure, understand, and ultimately improve genuine physical dexterity.

Paper arXiv 2605.19986 Code ModelScope Hugging Face

architecture · framework overview

From atomic primitives to diagnostic findings

how MetaFine fits together

Features.

3 dimension

Three-dimensional diagnosis

Evaluate understanding, perception, and behavior separately to expose hidden failure modes.

composable

Atomic compositional skills and tasks

Compose arbitrary fine-grained tasks from reusable atomic manipulation skills.

fine-grained assets

Fine-grained assets

Build on fine-grained object annotations with rapid tools for scalable asset expansion.

real + sim

Hybrid real–sim evaluation

Bridge simulation and reality with hybrid evaluation under limited hardware budgets.

easy to use

Drop-in install

Get started with a single pip install metafine — no infrastructure setup required.

built-in agent

Agent + skill library

An embedded agent with reusable skills automates most simulation and evaluation tasks end-to-end.

sample tasks · scroll to browse

+ ∞

Arbitrarily extensible

compose any FG task
from atomic skills

01 · Overview

Binary success is not enough.

MetaFine is a diagnostic framework for fine-grained robotic manipulation. It moves beyond binary success rates to reveal hidden failures in understanding, perception, and behavior, enabling unified evaluation across benchmarks, simulation, and the real world.

three axes

Three-dimensional diagnosis

Disentangle manipulation into understanding, perception, and behavior instead of collapsing everything into binary success.

ecosystem

Extensible ecosystem

Compose fine-grained tasks from atomic skills, scale to thousands of part-aware assets, and absorb heterogeneous external benchmarks under a single diagnostic formalism.

specific findings

Pinpoint architectural bottlenecks

Surface dimension-specific failure modes — semantic grounding gaps, visual encoder ceilings, and action-head trade-offs — translating aggregate scores into actionable design insights.

real + sim

Hybrid real–sim evaluation

Combine scalable simulation with limited real-world rollouts for more stable performance estimation under scarce hardware budgets.

02 · Evaluation Protocol

A three-dimensional diagnostic framework.

MetaFine decomposes robotic manipulation into three complementary axes — enabling researchers to identify not only whether a policy fails, but exactly where and why it fails.

01 / UNDERSTANDING

Comprehend, or replay?

Probe whether models understand fine-grained semantic instructions, such as selecting a different part of the same object, rather than replaying memorized routines.

Robot gripper grasping the cap of a bottle — original
grasp the cap

Robot gripper grasping the body of a bottle — intervene
grasp the body

02 / PERCEPTION

Precise perception for fine-grained manipulation.

Probe whether policies preserve precise part-level perception at close range, under both geometric perturbations in viewpoint and pose and photometric perturbations in lighting and color.

Original viewpoint and lighting — original

Geometric perturbation: viewpoint change — view

Photometric perturbation: lighting change — light

03 / BEHAVIOR

Fine-grained behavior through stage-wise success and smoothness.

Behavior is measured through stage-wise success and trajectory smoothness, revealing whether policies can both progress through long-horizon tasks and execute controlled motions.

grasp align insert rotate slide press toggle + extensible

benchmark · 7 vlas × 6 tasks

Per-task success rate across state-of-the-art VLAs

success · %

best per task highlighted in red · tabs switch evaluation condition · bars re-animate on each switch

03 · Composable Manipulation

Composable fine-grained manipulation.

MetaFine builds tasks bottom-up from two reusable building blocks — atomic skill primitives and a part-aware asset library — enabling structured task generation, benchmark absorption, and scalable diagnostic coverage.

A · Atomic Fine-Grained Skills

Tasks composed from atomic primitives.

Each task is a graph: nodes are atomic skills (grasp-part, align, insert, rotate-along, slide, press-part, toggle-part) and edges encode dependencies. Paths through the graph define tasks of arbitrary complexity, enabling structured absorption of external benchmarks under a single formalism.

grasp-part align insert rotate-along slide press-part toggle-part + extensible

B · Extensible Asset Library

Scalable, part-aware assets.

MetaFine ships with a part-annotated object library and rapid annotation tooling. Assets plug directly into the task graph, and external benchmarks are absorbed under the same compositional formalism for unified diagnostic comparison.

MetaFine asset library: parts, grasp poses, scenes — 431

objects

1,078

annotated parts

4,312

grasp poses

04 · Key Findings

What MetaFine reveals.

We cannot build what we cannot measure.

Finding 01

Binary success inflates capability.

Policies that appear strong under binary benchmarks often fail once part-level semantic and physical constraints are enforced. Under disentangled evaluation, agents scoring 85% on conventional benchmarks can drop to 40%, revealing that binary success often conflates lucky completion with genuine fine-grained skill.

Finding 02 · Understanding

Training paradigm matters more than scale.

Semantic grounding is governed more by how a model is trained than by how large it is. End-to-end policies often conflate language with visual priors, whereas modular architectures better preserve fine-grained instruction-following under attribute-level semantic changes.

Finding 03 · Perception

Visual resolution sets a hard precision ceiling.

Fine-grained manipulation depends on whether the encoder can resolve precise part-level spatial structure. When spatial fidelity is insufficient, downstream action generation cannot recover the missing information, making visual resolution an absolute ceiling on manipulation precision.

Finding 04 · Perception

Geometric and photometric robustness fail differently.

Robustness is not a single property. Geometric robustness is tied primarily to encoder spatial representations, while photometric robustness depends more strongly on modality-level invariance and pretraining diversity, exposing distinct failure sources under visual change.

Finding 05 · Behavior

Behavior trades off stability and expressiveness.

Deterministic action generation produces stable but rigid trajectories, while stochastic generation offers more diverse corrections but is more vulnerable to accumulated spatial drift. Fine-grained execution therefore depends not only on control smoothness, but also on how different action paradigms propagate perceptual uncertainty.

Finding 06

Hybrid real–sim evaluation stabilizes estimation.

Small-batch hardware evaluation is inherently noisy and can misrepresent true policy performance. By calibrating large-scale simulation with a limited set of paired real rollouts, hybrid real–sim evaluation produces estimates that are substantially more stable and more faithful to real-world capability.

05 · Hybrid Real–Sim

Scalable real-world validation.

MetaFine integrates 3D Gaussian Splatting-based real-to-sim transfer with prediction-powered inference. A small number of paired real-world rollouts calibrates large-scale simulation — enabling stable, reproducible estimates of true policy performance under limited hardware budgets.

Real-world physical setup — 01 · Real physical capture

real2sim transition
with mobile device

real2sim · mobile device ↓

3DGS-reconstructed simulation scene — 02 · Sim 3DGS reconstruction

policy rollout
in simulation

policy rollout ↓

live

03 · Rollout policy execution

PPI calibration plot — comparison across conditions

06 · Cite

BibTeX

@article{xu2026metafine,
  title   = {Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation},
  author  = {Xu, He-Yang and Zhang, Pengyuan and Ge, Zongyuan and Hao, Xiaoshuai and Belongie, Serge and Geng, Xin and Peng, Yuxin and Wei, Xiu-Shen},
  journal = {arXiv preprint arXiv:2605.19986},
  year    = {2026}
}

Diagnosing Fine-Grained Manipulation Beyond Binary Success.

From atomic primitives to diagnostic findings

Features.

Binary success is not enough.

Three-dimensional diagnosis

Extensible ecosystem

Pinpoint architectural bottlenecks

Hybrid real–sim evaluation

A three-dimensional diagnostic framework.

Comprehend, or replay?

Precise perception for fine-grained manipulation.

Fine-grained behavior through stage-wise success and smoothness.

Per-task success rate across state-of-the-art VLAs

Composable fine-grained manipulation.

Tasks composed from atomic primitives.

Scalable, part-aware assets.

What MetaFine reveals.

Binary success inflates capability.

Training paradigm matters more than scale.

Visual resolution sets a hard precision ceiling.

Geometric and photometric robustness fail differently.

Behavior trades off stability and expressiveness.

Hybrid real–sim evaluation stabilizes estimation.

Scalable real-world validation.

Limited real trials calibrate large-scale sim estimates.

BibTeX