Plug-and-Adapt: Multimodal Coreference Resolution at First Sight with a Pretrained Alignment Model (opens in new tab)

Visual information helps resolve ambiguity in coreference resolution, leading to notable performance gains. However, existing Multi-modal Coreference Resolution (MCR) methods require training with (partially) annotated data from the target dataset before they can be applied, preventing their direct usability and raising concerns about generalization. While Vision-Language Large Models (VLLMs) with billions of parameters offer promising zero-shot...

Read the original article