Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval
arxiv.org·3d
🧮Kolmogorov Complexity
Preview
Report Post

View PDF HTML (experimental)

Abstract:Training vision language models (VLMs) aims to align visual representations from a vision encoder with the textual representations of a pretrained large language model (LLM). However, many VLMs exhibit reduced factual recall performance compared to their LLM backbones, raising the question of how effective multimodal fine-tuning is at extending existing mechanisms within the LLM to visual inputs. We argue that factual recall based on visual inputs requires VLMs to solve a two-hop problem: (1) forming entity representations from visual inputs, and (2) recalling associated factual knowledge based on these entity representations. By benchmarking 14 VLMs with various…

Similar Posts

Loading similar posts...