When Transformers Sing: Adapting SpectralKD for Text-Based Knowledge Distillation
towardsdatascience.com·2w
Flag this post

While working on my Knowledge Distillation problem for intent classification, I faced a puzzling roadblock. My setup involved a teacher model, which is RoBERTa-large (finetuned on my intent classification), and a student model, which I was trying to train without losing too much accuracy compared to the teacher.

I experimented with multiple mapping techniques, connecting every 2nd layer to the student layer, averaging two teacher layers into one, and even assigning custom weights like giving (0.3 to l1 and 0.7 to l2). But no matter what combination I tried, the teacher’s accuracy never matched the student model.

That’s when I started exploring how to map the most informative layers to my student model so that the student can maximize its performance**. I wanted a way t…

Similar Posts

Loading similar posts...