Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation (opens in new tab)

Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distribution. However, because this supervision is generated via uncontrolled sampling, it provides no diagnostic insight into the model's specific errors or corrective guidance for its individual failure patterns. Consequently,...

Read the original article