We improved Composer by scaling training, generating more complex RL environments, and introducing new learning methods. (opens in new tab)
We improved Composer by scaling training, generating more complex RL environments, and introducing new learning methods. For example, we use text feedback during RL to learn faster by assigning credit in rollouts spanning hundreds of thousands of tokens.
Read the original article