Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy
arxiv.org·1d
🎙️Whisper
Preview
Report Post

View PDF HTML (experimental)

Abstract:The pursuit of out-of-distribution generalization in Vision-Language-Action (VLA) models is often hindered by catastrophic forgetting of the Vision-Language Model (VLM) backbone during fine-tuning. While co-training with external reasoning data helps, it requires experienced tuning and data-related overhead. Beyond such external dependencies, we identify an intrinsic cause within VLA datasets: modality imbalance, where language diversity is much lower than visual and action diversity. This imbalance biases the model toward visual shortcuts and language forgetting. To address this, we introduce BayesVLA, a Bayesian factorization that decomposes the policy into a visu…

Similar Posts

Loading similar posts...