Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance.
Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,128 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories.
Overview of ViFailback Framework.
Left: Real-world trajectory collection. Middle: Dataset comprising 58,126 VQA pairs. Right: Fine-tuning Qwen3-VL-8B as an external supervisor.
VSF Method: Visual Symbols-Following
PMC Method: Point-based Motion Control
@article{zeng2025diagnose,
title={Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols},
author={Zeng, Xianchao and Zhou, Xinyu and Li, Youcheng and Shi, Jiayou and Li, Tianle and Chen, Liangming and Ren, Lei and Li, Yong-Lu},
journal={arXiv preprint arXiv:2512.02787},
year={2025}
}