DriveStack-VLA: Render-Teacher Alignment for BEV-Based DeepStack Vision-Language-Action Model (opens in new tab)

Vision-Language-Action driving models convert a pretrained Vision-Language Model into a driving policy, allowing them to use world knowledge and follow language guidances. However, existing VLA driving models still lack driving-oriented spatial intelligence: their policies are mainly grounded on perspective image tokens and language priors, while precise motion planning requires metric geometry, top-down scene structure, and attention to safety-...

Read the original article