SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding
arxiv.org·10h
Flag this post

View PDF HTML (experimental)

Abstract:Video Question Answering (VideoQA) in the surgical domain aims to enhance intraoperative understanding by enabling AI models to reason over temporally coherent events rather than isolated frames. Current approaches are limited to static image features, and available datasets often lack temporal annotations, ignoring the dynamics critical for accurate procedural interpretation. We propose SurgViVQA, a surgical VideoQA model that extends visual reasoning from static images to dynamic surgical scenes. It uses a Masked Video–Text Encoder to fuse video and question features, capturing temporal cues such as motion and tool–tissue interactions, which a fine-tuned large lang…

Similar Posts

Loading similar posts...