I post-trained a model to reliably roll a die (opens in new tab)

Discussed on Hacker News

why llms always output 4 when asked to roll a dice, and how exploration techniques in reinforcement learning fix this.