Yahtzee: Reinforcement Learning Techniques for Stochastic Combinatorial Games

Computer Science > Machine Learning

arXiv:2601.00007 (cs)

Abstract:Yahtzee is a classic dice game with a stochastic, combinatorial structure and delayed rewards, making it an interesting mid-scale RL benchmark. While an optimal policy for solitaire Yahtzee can be computed using dynamic programming methods, multiplayer is intractable, motivating approximation methods. We formulate Yahtzee as a Markov Decision Process (MDP), and train self-play agents using various policy gradient methods: REINFORCE, Advantage Actor-Critic (A2C), and Proximal Policy Optimization (PPO), all using a multi-headed network with a shared trunk. We ablate feature and action encodings, architecture, return estimators, and entropy regularization to…

Computer Science > Machine Learning

arXiv:2601.00007 (cs)

View PDF

Abstract:Yahtzee is a classic dice game with a stochastic, combinatorial structure and delayed rewards, making it an interesting mid-scale RL benchmark. While an optimal policy for solitaire Yahtzee can be computed using dynamic programming methods, multiplayer is intractable, motivating approximation methods. We formulate Yahtzee as a Markov Decision Process (MDP), and train self-play agents using various policy gradient methods: REINFORCE, Advantage Actor-Critic (A2C), and Proximal Policy Optimization (PPO), all using a multi-headed network with a shared trunk. We ablate feature and action encodings, architecture, return estimators, and entropy regularization to understand their impact on learning. Under a fixed training budget, REINFORCE and PPO prove sensitive to hyperparameters and fail to reach near-optimal performance, whereas A2C trains robustly across a range of settings. Our agent attains a median score of 241.78 points over 100,000 evaluation games, within 5.0% of the optimal DP score of 254.59, achieving the upper section bonus and Yahtzee at rates of 24.9% and 34.1%, respectively. All models struggle to learn the upper bonus strategy, overindexing on four-of-a-kind’s, highlighting persistent long-horizon credit-assignment and exploration challenges.


Comments:	20 pages, 19 figures
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
ACM classes:	I.2.1
Cite as:	arXiv:2601.00007 [cs.LG]
	(or arXiv:2601.00007v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2601.00007 arXiv-issued DOI via DataCite

Submission history

From: Nicholas Pape [view email] [v1] Thu, 18 Dec 2025 20:03:32 UTC (78 KB)

Current browse context:

cs.LG

Change to browse by:

export BibTeX citation

Computer Science > Machine Learning

Computer Science > Machine Learning

Submission history

Bookmark

Similar Posts