Cross-embodiment traversability transfer and feasible trajectory generation. The legged robot (source) explores the environment and builds a traversability prior (blue path), which the aerial robot (target) uses for safe navigation through the shared workspace (green volume). Credit: Iana Zhura et al.
While the capabilities of robots have improved significantly…
Cross-embodiment traversability transfer and feasible trajectory generation. The legged robot (source) explores the environment and builds a traversability prior (blue path), which the aerial robot (target) uses for safe navigation through the shared workspace (green volume). Credit: Iana Zhura et al.
While the capabilities of robots have improved significantly over the past decades, they are not always able to reliably and safely move in unknown, dynamic and complex environments. To move in their surroundings, robots rely on algorithms that process data collected by sensors or cameras and plan future actions accordingly.
Researchers at Skolkovo Institute of Science and Technology (Skoltech) have developed SwarmDiffusion, a new lightweight Generative AI model that can predict where a robot should go and how it should move relying on a single image. SwarmDiffusion, introduced in a paper pre-published on the server arXiv, relies on a diffusion model, a technique that gradually adds noise to input data and then removes it to attain desired outputs.
"Navigation is more than ’seeing," a robot also needs to decide how to move, and this is where current systems still feel outdated," Dzmitry Tsetserukou, senior author of the paper, told Tech Xplore.
"Traditionally, robots build a detailed map, mark which areas appear safe, and then run a heavy algorithm to find a route. It works, but it’s slow and doesn’t take full advantage of today’s progress in AI. This is what inspired our work."
The key objective of the recent work by Tsetserukou and his colleagues was to develop a "thinking" path planning method which is learnable and generalizable. SwarmDiffusion can quickly determine where a robot should go by processing a single image. Such an approach would remove the need for additional mapping steps or sophisticated path planning techniques.
"Another challenge we addressed is that every robot platform moves differently," explained Tsetserukou.
"A drone, a quadruped, and a wheeled robot each have their own motion style. Most current approaches require collecting special data for each of them, which is time-consuming and simply not scalable. Our goal was to design a system that needs only a few robot-specific trajectories for pretraining yet can still figure out how to move safely from a single image."
SwarmDiffusion: A path planner that reasons
The new model introduced by the researchers is partly inspired by how humans navigate their surroundings relying on their common sense. The team wanted to allow robots to rapidly and ‘intuitively’ identify dangerous areas, obstacles and other challenges in their surroundings, planning their actions accordingly.
"Path planning is one of the key tasks for autonomous robots," said Tsetserukou. "The robot needs not only to design the path from point A to point B, but also to avoid obstacles. Yet, most path planning algorithms are built on classical methods, such as A*, RRT, APF, CHOMP, etc., they do not have any generalization to work in an unseen environment and require a heavy map, that in some cases weighs terabytes of memory."
From a single 2D image, SwarmDiffusion can identify areas in which a robot can safely move and those that are risky. The model relies on predictions made by Vision-Language Model (VLM) to spot open floors (i.e., large spaces with fewer walls), obstacles, narrow gaps and other possible hazards that can prevent safe navigation.
The proposed model consists of two interconnected components: (1) Traversability Student model, where a frozen visual encoder and state encoder jointly modulate features via FiLM to produce a traversability prediction distilled from a vision–language model (VLM); and (2) Diffusion-based Trajectory Generation, where the UNet progressively denoises a random trajectory xt conditioned on the modulated visual features and start–goal vector, yielding feasible and safe paths x0. The process repeats for N denoising steps. Credit: Iana Zhura et al.
Rather than creating a detailed map of the surrounding environment, SwarmDiffusion simply delineates the path that the robot should follow to avoid collisions and safely proceed towards a target location. To create this path, the model starts from a rough guess and then gradually but quickly refines it with denoising.
"High level reasoning is done by a VLM model that specifies the position of obstacles, traversable areas, and behavior with no prompt from the user," explained Tsetserukou "On the other hand, diffusion with the denoising process generates a smooth path to the goal."
"A further advantage of SwarmDiffusion is that it can be applied to a broad range of robots, including drones, legged robots and wheeled robots. This is because it can learn general movement principles that are not specific to a single type of robot, requiring only minimal information (e.g., a robot’s preferred direction or turning behavior)," said Ph.D. student Iana Zhura, the first author of the paper.
SwarmDiffusion is also lightweight and can be easily run directly by the robot’s internal processors. The researchers tested their model in a series of tests, where they applied it to two different robots, a drone and a dog-inspired four-legged robot. They found that it reliably planned the robots’ future actions within a short amount of time (around 90 milliseconds).
UAV simulation architecture. Credit: Iana Zhura et al.
"Our experiments proved that it works reliably in environments that were never seen before by the robot," said Tsetserukou.
"The human brain enables navigation in new environments from the first sight, leveraging stereo vision data of the eyes. On the contrary, SwarmDiffusion relies only on 2D images that outperform human perception. We do not need expensive 3D sensors such as LiDARs, RADAR, or depth cameras for path planning anymore."
Real-world applications and future improvements
The new model developed by Tsetserukou and his colleagues could greatly simplify robot navigation, opening new possibilities for various robotics applications. For instance, it could be used to improve the navigation of robot teams in warehouses, agricultural settings or at industrial sites, as well as robots designed to complete search and rescue missions, infrastructure inspections, to deliver parcels and monitor natural environments.
"We proved that robots do not need elaborate maps or long chains of processing to move with confidence," said Tsetserukou.
"SwarmDiffusion can take a single snapshot and turn it into motion, cutting away much of the overhead that usually slows robots down, because it requires only a few path examples for each platform. It also points toward an exciting possibility: a single, versatile model that could guide a wide variety of robots without constant rebuilding or customization."
Image summarizing how the team envision the deployment of SwarmDiffusion on different types of robots. Credit: Iana Zhura et al.
The researchers are now planning new studies aimed at further improving SwarmDiffusion and evaluating its potential in a wider range of real-world experiments. They also hope to soon deploy it on multiple robots at once, to determine whether it improves their ability to tackle a common mission as teams.
"SwarmDiffusion opens the door for groups of robots to work together as a unified, intelligent team which leverages Swarm Diffusion Intelligence to exchange the knowledge about obstacles and generated trajectories," said Tsetserukou.
"Looking ahead, we plan to extend the approach to more tasks, not just navigation. The same idea that generates a path could also help with choosing better viewpoints, exploring a scene, or even supporting manipulation tasks. The long-term goal is to build a single model that can understand what it sees, understand what it needs to do, and produce the right action without relying on many separate modules."
In their upcoming research, Tsetserukou and his colleagues plan to focus more on the coordination between many robots in a team. Specifically, they will explore the possibility of using SwarmDiffusion to the actions of many different types of robots at once, allowing them to complete missions faster and more efficiently.
"In the future we will build a Multi-Agent Word Foundation Model for navigation of swarms of heterogeneous robots so that humanoid, mobile, aerial, quadruped robots create independent paths and not intersect with each other and humans in unseen environments," added Tsetserukou.
"The task allocation for multi-agent systems will also be guided by diffusion. That development will embody our concept in Future 6.0 where robots with Physical AI will be working in robotic cities and be seamlessly integrated in the human world."
Written for you by our author Ingrid Fadelli, edited by Sadie Harley, and fact-checked and reviewed by Robert Egan—this article is the result of careful human work. We rely on readers like you to keep independent science journalism alive. If this reporting matters to you, please consider a donation (especially monthly). You’ll get an ad-free account as a thank-you.
More information: Iana Zhura et al, SwarmDiffusion: End-To-End Traversability-Guided Diffusion for Embodiment-Agnostic Navigation of Heterogeneous Robots, arXiv (2025). DOI: 10.48550/arxiv.2512.02851
Journal information: arXiv
© 2026 Science X Network
Citation: One image is all robots need to find their way (2026, January 8) retrieved 8 January 2026 from https://techxplore.com/news/2026-01-image-robots.html
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.