BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents (opens in new tab)
Stepwise group-based RL is an attractive way to train long-horizon LLM agents without a learned critic: it reuses multiple sampled rollouts to estimate local advantages. Its weakness is less visible but more fundamental: every group-relative estimator assumes that the steps it compares are equivalent for credit assignment. We show that current agentic variants violate this assumption through a state-action credit mismatch. The observation-hash...
Read the original article