Artificial Intelligence
arXiv
![]()
Wenqian Zhang, Weiyang Liu, Zhen Liu
16 Oct 2025 • 3 min read

AI-generated image, based on the article abstract
Quick Insight
How AI is Learning to Build Its Own Robots
What if a computer could design its own robot? Scientists have discovered that today’s large language models—those chatty AI systems—can be taught to act like tiny engineers. Using a video‑game‑style playground called BesiegeField, the AI picks up virtual LEGO bricks, snaps them together, and watches the creation crawl, roll, or lift objects in a simulated world. The test sho…
Artificial Intelligence
arXiv
![]()
Wenqian Zhang, Weiyang Liu, Zhen Liu
16 Oct 2025 • 3 min read

AI-generated image, based on the article abstract
Quick Insight
How AI is Learning to Build Its Own Robots
What if a computer could design its own robot? Scientists have discovered that today’s large language models—those chatty AI systems—can be taught to act like tiny engineers. Using a video‑game‑style playground called BesiegeField, the AI picks up virtual LEGO bricks, snaps them together, and watches the creation crawl, roll, or lift objects in a simulated world. The test shows that the AI needs a mix of spatial sense, clever planning, and the ability to follow step‑by‑step instructions—just like a human builder. Early experiments reveal that current open‑source models still stumble, but by adding a dash of reinforcement learning, they start to improve, learning from trial and error much like a child learning to ride a bike. Imagine future machines that can design better tools for us, or even craft custom gadgets on demand. This breakthrough hints at a future where creativity isn’t just human‑only, opening doors to smarter, self‑assembling technology that could reshape everyday life. 🌟
Article Short Review
Exploring AI’s Creative Potential in Machine Design
This insightful research investigates whether large language models (LLMs) can learn to create complex machines, a task traditionally indicative of human intelligence and engineering prowess. The study frames this inquiry through the lens of compositional machine design, where functional machines are assembled from standardized components within a simulated physical environment. To facilitate this, the authors introduce BesiegeField, a novel testbed built upon the popular machine-building game Besiege, enabling part-based construction, realistic physical simulation, and reward-driven evaluation. Benchmarking state-of-the-art LLMs with agentic workflows revealed significant shortcomings, particularly in spatial reasoning and strategic assembly. Consequently, the research explores reinforcement learning (RL) as a promising avenue for improvement, curating a cold-start dataset and conducting finetuning experiments to highlight persistent challenges at the intersection of language, machine design, and physical reasoning.
Critical Evaluation
Strengths
The introduction of BesiegeField is a significant strength, offering a unique, interactive environment that balances realistic physics with part semantics and compositional rules. This platform provides a robust framework for benchmarking LLMs and exploring advanced techniques like Reinforcement Learning with Verifiable Rewards (RLVR). The methodology is comprehensive, employing both single LLM agents with Chain-of-Thought (CoT) reasoning and iterative multi-agent systems, which include meta-designer and builder agents, to tackle complex design challenges.
The study effectively identifies critical capabilities required for success, such as spatial reasoning, strategic assembly, and instruction-following, pinpointing where current LLMs fall short. The exploration of RL finetuning, utilizing methods like Group Relative Policy Optimization (GRPO) and LoRA parametrization, demonstrates a forward-thinking approach to enhancing AI’s design capabilities. The curation of a cold-start dataset further supports these experimental investigations, providing a solid foundation for future research.
Weaknesses
Despite the advancements, a key weakness lies in the observed limitations of current open-source LLMs, which significantly underperform in compositional machine design tasks. While RL finetuning improves design validity and performance, the findings indicate that models primarily make detail-level adjustments rather than demonstrating fundamental compositional breakthroughs. This suggests that core challenges in spatial precision and 3D understanding persist, requiring more than just finetuning to overcome. The reliance on cold-start datasets for RL also implies a need for pre-existing design knowledge, potentially limiting the models’ capacity for truly novel, unconstrained creation.
Implications
This research has profound implications for the future of AI in engineering design and creative problem-solving. It underscores the necessity for developing LLMs with enhanced physical reasoning and 3D understanding, moving beyond purely linguistic capabilities. BesiegeField emerges as an invaluable testbed for advancing these frontiers, providing a standardized environment for evaluating and improving AI agents. The findings suggest that a hybrid approach, integrating the generative power of LLMs with the iterative optimization of RL, could pave the way for more capable AI designers. This work highlights critical areas for future research, pushing the boundaries of what AI can achieve in complex, physically constrained environments.
Conclusion
Overall, this article makes a valuable contribution to understanding the capabilities and limitations of LLMs in compositional machine design. By introducing BesiegeField and systematically benchmarking LLMs, the authors clearly delineate the challenges in spatial reasoning and strategic assembly. While current LLMs fall short, the exploration of reinforcement learning offers a promising path for incremental improvements, particularly in refining existing designs. This research not only provides a robust framework for future studies but also illuminates the significant open challenges that must be addressed for AI to truly master the art of creative engineering.
Article Comprehensive Review
Exploring the Creative Frontier: Large Language Models in Compositional Machine Design
The intricate process of designing complex machines has long been a hallmark of human ingenuity and a cornerstone of engineering. This groundbreaking research delves into whether Large Language Models (LLMs) can transcend their traditional text-based roles to engage in creative machine design. The study frames this inquiry through the lens of compositional machine design, a task where functional machines, such as those capable of locomotion or manipulation, are assembled from standardized components within a simulated physical environment. To facilitate this ambitious investigation, the researchers introduce BesiegeField, an innovative testbed built upon the popular machine-building game Besiege. This platform uniquely enables part-based construction, realistic physical simulation, and reward-driven evaluation, providing a robust environment for experimentation. Through benchmarking state-of-the-art LLMs utilizing sophisticated agentic workflows, the study meticulously identifies critical capabilities essential for success, including advanced spatial reasoning, strategic assembly, and precise instruction-following. Recognizing the current limitations of open-source models in these areas, the research pivots to explore Reinforcement Learning (RL) as a promising avenue for improvement, curating a cold-start dataset and conducting targeted RL finetuning experiments. Ultimately, this work illuminates significant open challenges at the complex intersection of language, machine design, and physical reasoning, charting a course for future advancements in AI’s creative potential.
Critical Evaluation: Unpacking LLMs in Engineering Design
Strengths of the BesiegeField Framework and Methodology
One of the most significant strengths of this research lies in the introduction of BesiegeField, a novel and highly effective testbed for studying compositional machine design. This environment strikes an impressive balance between realistic physics, detailed part semantics, and clear compositional rules, making it an ideal platform for rigorous scientific inquiry. BesiegeField’s interactive nature allows Large Language Model (LLM) agents to not only construct machines but also to simulate their performance and evaluate their efficacy, providing a comprehensive feedback loop crucial for learning and refinement. Its design supports both the benchmarking of existing LLMs and the exploration of advanced techniques like Reinforcement Learning (RL) finetuning, offering a versatile tool for generating diverse design solutions and pushing the boundaries of AI capabilities.
The methodology employed for evaluating LLM performance is another notable strength. The researchers developed a parsimonious machine representation and established robust quantitative evaluation metrics, including validity rates and simulation scores. This systematic approach allows for objective assessment of design quality and functional success. Furthermore, the detailed defect analysis and proposed modification plans, integrated within the agentic workflows, demonstrate a sophisticated understanding of iterative design processes. The exploration of both single LLM agents leveraging Chain-of-Thought (CoT) reasoning and iterative multi-agent systems for machine design and refinement showcases a comprehensive approach to understanding how LLMs can tackle complex engineering tasks.
The integration of Reinforcement Learning (RL) as a path to improvement is a particularly strong aspect of this study. By curating a cold-start dataset and conducting targeted RL finetuning experiments using methods like Group Relative Policy Optimization (GRPO) and LoRA parametrization, the research provides concrete evidence of how LLMs can enhance their design capabilities. The findings clearly indicate that RL significantly improves both design validity and overall performance, especially when initiated with a cold-start dataset. This exploration into Reinforcement Learning with Verifiable Rewards (RLVR) for compositional machine design represents a forward-thinking approach to overcoming current LLM limitations, offering a clear pathway for future research and development in AI-driven engineering.
Finally, the ambition of the research to address a fundamental question—whether LLMs can truly learn to create—is a significant strength. By tackling a complex engineering domain like compositional machine design, the study pushes the boundaries of what is expected from LLMs, moving beyond mere text generation to engage with physical reasoning and spatial intelligence. This foundational work provides a robust framework for future investigations into AI’s creative potential in practical, real-world applications, setting a high standard for interdisciplinary research at the confluence of AI, engineering, and cognitive science.
Challenges and Limitations in LLM Machine Design
Despite the innovative framework and promising results, the research also highlights several critical challenges and limitations inherent in applying current Large Language Models (LLMs) to compositional machine design. A primary weakness identified is the significant shortfall of state-of-the-art open-source LLMs in key capabilities such as spatial reasoning, strategic assembly, and precise instruction-following. The task of designing machines in a 3D physical environment demands a sophisticated understanding of spatial relationships, object interactions, and structural integrity, areas where current LLMs, primarily trained on text, struggle considerably. The empirical findings underscore these difficulties, particularly in 3D understanding, which remains a substantial hurdle for LLMs attempting to construct functional physical objects.
While Reinforcement Learning (RL) finetuning demonstrated improvements in design validity and performance, the study notes that these enhancements primarily manifest as detail-level adjustments rather than fundamental conceptual design breakthroughs. This suggests that while RL can refine existing designs or correct minor flaws, it may not yet enable LLMs to generate truly novel or radically different machine architectures from scratch. The inherent complexity of compositional design, which requires not only spatial precision but also a deep understanding of physical interactions and emergent properties, poses a significant barrier to achieving higher-level creative design capabilities with current models.
The reliance on a curated cold-start dataset for RL finetuning, while beneficial for the experiments, also points to a potential limitation. The process of curating such a dataset can be resource-intensive and may not always be scalable or readily available for every new design challenge. This dependency implies that significant human effort or pre-existing data is still required to guide the LLMs, rather than the models autonomously discovering optimal design principles. Furthermore, the discussion on challenges in agentic machine design emphasizes the ongoing need for LLMs to acquire domain-specific knowledge and explore novel solutions, indicating that a fully autonomous and highly creative AI designer is still a distant goal.
Another potential caveat lies in the generalizability of findings from the BesiegeField environment. While BesiegeField offers a realistic physics simulation, it is still a simplified, game-based environment. The transition from designing machines in a simulated game to real-world engineering challenges, which involve material properties, manufacturing constraints, and unpredictable environmental factors, could introduce new complexities that current LLM capabilities and the BesiegeField framework may not fully address. The gap between simulated success and practical application remains an important consideration for the broader impact of this research.
Implications for AI and Engineering
The implications of this research for the future of AI in engineering design are profound and far-reaching. By demonstrating that LLMs, even with current limitations, can engage in compositional machine design, the study opens new avenues for design automation and intelligent assistance in complex engineering tasks. This work suggests a future where AI could play a significant role in generating initial design concepts, optimizing existing structures, or even identifying novel solutions that human engineers might overlook. The development of BesiegeField as a testbed provides a crucial tool for researchers to continue pushing these boundaries, fostering innovation in areas like robotics, architectural design, and advanced manufacturing.
This research also has significant implications for advancing the core capabilities of Large Language Models themselves. The identified shortcomings in spatial reasoning, 3D understanding, and physical reasoning highlight critical areas where LLM development needs to evolve. Future LLMs will likely incorporate more sophisticated multimodal architectures, integrating visual and spatial data more effectively to overcome these limitations. The exploration of Reinforcement Learning (RL) as a finetuning mechanism points towards a paradigm shift where LLMs are not just passive knowledge repositories but active agents capable of learning through interaction and feedback in complex environments. This could lead to LLMs that are more robust, adaptable, and capable of performing tasks requiring a deeper understanding of the physical world.
The study underscores the emergence of new interdisciplinary research avenues at the intersection of language, machine design, and physical reasoning. This convergence demands collaboration between AI researchers, mechanical engineers, cognitive scientists, and game developers to build more intelligent and creative AI systems. The challenges highlighted, such as achieving high spatial precision and strategic assembly, will drive innovation in areas like neuro-symbolic AI, embodied AI, and advanced simulation techniques. This research serves as a catalyst for developing AI that can not only understand and generate language but also interact with and shape the physical world in meaningful ways.
Ultimately, this work paves the way for enhanced human-AI collaboration in engineering. While LLMs may not yet fully replace human designers, they can become invaluable partners, assisting with iterative design processes, exploring vast design spaces, and providing creative prompts. The ability of LLMs to make detail-level adjustments through RL finetuning suggests their immediate utility in optimizing existing designs or performing rapid prototyping. As LLMs continue to improve in spatial and physical reasoning, their role could expand to more conceptual design phases, leading to a synergistic relationship where human intuition and AI’s computational power combine to achieve unprecedented levels of innovation in machine design.
Conclusion: Charting the Course for Creative AI in Engineering
This comprehensive analysis of the research into Large Language Models (LLMs) for compositional machine design reveals a pivotal step forward in understanding AI’s creative potential. The introduction of BesiegeField stands as a significant methodological contribution, providing a robust and interactive environment for benchmarking LLMs and exploring advanced learning paradigms. The study effectively demonstrates that while current open-source LLMs exhibit notable shortcomings in critical areas such as spatial reasoning and 3D understanding, the strategic application of Reinforcement Learning (RL) finetuning offers a promising pathway to enhance their design capabilities, particularly in refining and optimizing existing structures.
The research meticulously identifies the core challenges that LLMs face when transitioning from linguistic tasks to complex physical design, emphasizing the need for models to acquire deeper physical reasoning and strategic assembly skills. Despite these limitations, the findings underscore the immense potential for LLMs to contribute to engineering design, paving the way for future advancements in AI-driven automation and intelligent design assistance. This work is not merely an assessment of current AI capabilities but a foundational piece that charts a clear course for future research at the exciting intersection of language, machine design, and physical intelligence.
In conclusion, this article makes a substantial contribution to the field by rigorously testing the boundaries of LLM creativity in a tangible, engineering context. It provides invaluable insights into both the current state and the future trajectory of AI in design, highlighting the critical areas for development while simultaneously showcasing the transformative power of integrating advanced learning techniques like RL. The journey towards truly creative and autonomous AI designers is ongoing, but this research offers a compelling vision and a robust framework for navigating the complex challenges ahead, ultimately fostering a future where AI and human ingenuity collaborate to build the machines of tomorrow.