Efficient Remote Sensing Instance Segmentation with Linear-Time State Space Distilled Visual Foundation Models (opens in new tab)

The computational complexity of Transformers scales quadratically with the number of tokens, which significantly constrains the efficiency of vision models, particularly recent ViT-based foundation models in dense prediction tasks. Instance segmentation, a typical dense visual prediction task in the remote sensing field, faces similar challenges. In this paper, inspired by the recent advances of knowledge distillation in large language models, w...

Read the original article