Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols

Xianchao Zeng1,2⚖️, Xinyu Zhou2,3⚖️, Youcheng Li1,2, Jiayou Shi4, Tianle Li4,
Liangming Chen3✉️, Lei Ren1✉️, Yong-Lu Li2,4✉️
1Beihang University, 2Shanghai Innovation Institute,
3Southern University of Science and Technology, 4Shanghai Jiao Tong University
⚖️ Equal contribution. ✉️ Corresponding authors.
CVPR 2026
Pipeline

Our pipeline leverages real-world failure data to build a dataset and train ViFailback-8B for failure diagnosis and correction.

Abstract

Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance.

Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,128 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories.

Overview

Overview of ViFailback Framework.
Left: Real-world trajectory collection. Middle: Dataset comprising 58,126 VQA pairs. Right: Fine-tuning Qwen3-VL-8B as an external supervisor.

Overview

ViFailback Benchmark

Benchmark

Experiments

Benchmark Results

Results

Real-world Experimental Demos

VSF Method: Visual Symbols-Following

PMC Method: Point-based Motion Control

BibTeX

@article{zeng2025diagnose,
  title={Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols},
  author={Zeng, Xianchao and Zhou, Xinyu and Li, Youcheng and Shi, Jiayou and Li, Tianle and Chen, Liangming and Ren, Lei and Li, Yong-Lu},
  journal={arXiv preprint arXiv:2512.02787},
  year={2025}
}