ApET: Approximation-Error Guided Token Compression for Efficient VLMs (opens in new tab)

arXiv:2602.19870v1 Announce Type: new Abstract: Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are...

Read the original article