Graph it first! Enabling Reasoning on Long-form Egocentric Videos through Scene Graphs (opens in new tab)
Existing multi-modal large language models (MLLMs) face significant challenges in processing long video sequences due to strict input token limitations. As a result, current video understanding approaches, especially in egocentric settings characterized by complex dynamics, frequent state changes, and moving cameras, are forced to massively subsample frames. This leads to severe loss of temporal and contextual information, constraining their abi...
Read the original article