Learning General Policies with Policy Gradient Methods

Title:Learning General Policies with Policy Gradient Methods

Abstract:While reinforcement learning methods have delivered remarkable results in a number of settings, generalization, i.e., the ability to produce policies that generalize in a reliable and systematic way, has remained a challenge. The problem of generalization has been addressed formally in classical planning where provable correct policies that generalize over all instances of a given domain have been learned using combinatorial methods. The aim of this work is to bring these two research threads together to illuminate the conditions under which (deep) reinforcement learning approaches, and in particular, po…

Title:Learning General Policies with Policy Gradient Methods

View PDF HTML (experimental)

Abstract:While reinforcement learning methods have delivered remarkable results in a number of settings, generalization, i.e., the ability to produce policies that generalize in a reliable and systematic way, has remained a challenge. The problem of generalization has been addressed formally in classical planning where provable correct policies that generalize over all instances of a given domain have been learned using combinatorial methods. The aim of this work is to bring these two research threads together to illuminate the conditions under which (deep) reinforcement learning approaches, and in particular, policy optimization methods, can be used to learn policies that generalize like combinatorial methods do. We draw on lessons learned from previous combinatorial and deep learning approaches, and extend them in a convenient way. From the former, we model policies as state transition classifiers, as (ground) actions are not general and change from instance to instance. From the latter, we use graph neural networks (GNNs) adapted to deal with relational structures for representing value functions over planning states, and in our case, policies. With these ingredients in place, we find that actor-critic methods can be used to learn policies that generalize almost as well as those obtained using combinatorial approaches while avoiding the scalability bottleneck and the use of feature pools. Moreover, the limitations of the DRL methods on the benchmarks considered have little to do with deep learning or reinforcement learning algorithms, and result from the well-understood expressive limitations of GNNs, and the tradeoff between optimality and generalization (general policies cannot be optimal in some domains). Both of these limitations are addressed without changing the basic DRL methods by adding derived predicates and an alternative cost structure to optimize.


Comments:	In Proceedings of the 20th International Conference on Principles of Knowledge Representation and Reasoning (KR 2023)
Subjects:	Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2512.19366 [cs.AI]
	(or arXiv:2512.19366v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2512.19366 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Simon Ståhlberg [view email] [v1] Mon, 22 Dec 2025 13:08:58 UTC (103 KB)

Title:Learning General Policies with Policy Gradient Methods

Title:Learning General Policies with Policy Gradient Methods

Submission history

Similar Posts