The Unreasonable Effectiveness of VLMs for Zero-shot Procedural Mistake Detection (opens in new tab)

Procedural mistake detection is important for quality control and user assistance across many disciplines. Recent work in this field has achieved significant gains by using the reasoning capabilities of Video-Language Models (VLMs) as components within multi-stage pipelines, which consist of separate modules for supervised temporal action segmentation, error detection, and explainability. Consequently, they remain dependent on tailored trainin...

Read the original article