MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation (opens in new tab)
Humans naturally leverage diverse sensing modalities to interact with the physical world, while most Vision-Language-Action (VLA) models for robotics rely solely on RGB observations. This limits their ability to perceive physical properties that are difficult or impossible to infer from RGB cameras, such as temperature, sound, or radar response. We present MuseVLA, an adaptive multimodal sensing VLA model that integrates novel sensors as on-dema...
Read the original article