Multi-Objective Reward and Preference Optimization: Theory and Algorithms
arxiv.org·1d
🎛️Feed Filtering
Preview
Report Post

Computer Science > Machine Learning

arXiv:2512.10601 (cs)

View PDF

Abstract:This thesis develops theoretical frameworks and algorithms that advance constrained reinforcement learning (RL) across control, preference learning, and alignment of large language models. The first contribution addresses constrained Markov Decision Processes (CMDPs) under the average-cost criterion through the Average-Constrained Policy Optimization (ACPO) algorithm. ACPO integrates sensitivity analysis with trust-region updates to ensure stable constraint handling, achieving state-of-the-art empirical performance with theoretical guarantees. Constrained RL is then extended to finite-horizon settings via e-COP, the first policy optimization method for ep…

Similar Posts

Loading similar posts...