While working on my Knowledge Distillation problem for intent classification, I faced a puzzling roadblock. My setup involved a teacher model, which is RoBERTa-large (finetuned on my intent classification), and a student model, which I was trying to train without losing too much accuracy compared to the teacher.
I experimented with multiple mapping techniques, connecting every 2nd layer to the student layer, averaging two teacher layers into one, and even assigning custom weights like giving (0.3 to l1 and 0.7 to l2). But no matter what combination I tried, the teacher’s accuracy never matched the student model.
That’s when I started exploring how to map the most informative layers to my student model so that the student can maximize its performance**. I wanted a way t…
While working on my Knowledge Distillation problem for intent classification, I faced a puzzling roadblock. My setup involved a teacher model, which is RoBERTa-large (finetuned on my intent classification), and a student model, which I was trying to train without losing too much accuracy compared to the teacher.
I experimented with multiple mapping techniques, connecting every 2nd layer to the student layer, averaging two teacher layers into one, and even assigning custom weights like giving (0.3 to l1 and 0.7 to l2). But no matter what combination I tried, the teacher’s accuracy never matched the student model.
That’s when I started exploring how to map the most informative layers to my student model so that the student can maximize its performance**. I wanted a way to quantify which layer of the teacher model truly matters for distillation.
In that search, I stumbled upon a fascinating paper—”SpectralKD: A Unified Framework for Interpreting and Distilling Vision Transformers via Spectral Analysis,” which tackled a similar problem but in the image domain. The authors used a spectral analysis approach (Spectral KD) to more intelligently align the teacher and student models.
Curious, I decided to adapt the idea to text data – and **BOOM!!!, it actually worked! **For the first time, my student model started thinking almost like its teacher.
Source: Author
Here’s the layer intensity graph of my fine-tuned RoBERTa-large model. Based on the spectral insights, I selected layers 1–9 and 21–23 for my student model during knowledge distillation, the ones carrying the richest information.
I can’t share my dataset or code for confidentiality reasons, but I’ll walk you through how the paper’s image-based approach inspired my text-based adaptation, and how you can think about doing the same.
Behind the Scenes: How FFT Reveals a Model’s Spectral Soul
So, let’s start with spectral intensity, and slowly dive into the real magician here: the Fast Fourier Transform (FFT).
In the spectralKD paper, the authors introduce a framework that helps us to see Vision Transformer(ViTs), not just what they are predicting, but also how the information flows in the layers. Instead of relying on intuition or visualisation, they use spectral analysis, a way to measure the frequency richness of the model’s internal representations.
Imagine each Transformer layer as the musician in an orchestra, some layers play high notes(fine details), while others play low notes(broad features). The FFT helps us to listen to each player’s music separately and filter out which one is having the strongest melodies, i.e., the most information-rich signals.
Source: Author
Step 1: Feature maps, The raw material
B is batch size C is number of channels and, H,W is the spatial height and width.
Step 2: Applying the fourier Transform
The authors apply a 1-dimensional FFT along the channel dimension to translate these real-valued activations into the frequency domain: F(X)=FFT(X)
This means: For every spatial location (b, h, w), a 1D FFT is computed across all channels. The result is a complex-valued tensor (since FFT outputs real + imaginary parts). F(X) therefore tells us how much of each frequency is present in that layer’s representation.
And if you’re wondering, “Why FFT though?” — hold that thought. Because later in this blog, we’re going to uncover exactly why FFT is the perfect tool to measure a model’s inner intensity.
Step 3: measuring frequency strength
Re(F(X)) is the real part, Im(F(X)) is the imaginary part.
Step 4: Averaging across the map
Now we want to summarize this intensity across all positions in the layer:
This step tells us the average intensity of the single channel
And then you can simply do average of each channels. Voilà! Now you have the spectral intensity of the single layer of the Vision Transformer.
Peeking into the Frequency Realm: The Fourier Lens of SpectralKD
Let’s look into the Fast Fourier Transform:

Xₖ is the input sequence (your signal, feature, or activation pattern). xₙ is the frequency component at the frequency index. N is the number of points in the sequence (i.e., number of channels or features).
Each term e⁻ʲ²πᵏⁿ/ᴺ acts as a rotating phasor, a tiny complex wave spinning through the signal space, and together, they form one of the most beautiful ideas in signal processing.
Source: Author (Here, a rotating phasor e⁻ʲ²πᵏⁿ/ᴺ is getting multiplied by g(t) in a complex plane)
source: Author (Average out all the points in the complex plane, then it will give you the center of mass of the phasor entity, and it gets peaked only at a specific frequency or K (in the above case, it is 3))
.OMG! What just happened here? Let me break it down.
When you multiply your hidden activations xₙ (say, across channels or feature dimensions) by this phasor, you’re essentially asking:
“Hey, layer, how much of the k-th type of variation do you contain in your representations?”
Each frequency k corresponds to a distinct pattern scale across the feature dimensions.
Lower k values capture broad, smooth semantic structures (like topic-level context), while higher k values capture rapid, fine-grained variations (like token-level nuances or syntactic signals).
Now here’s the fun part: if some layer resonates with a particular frequency pattern, the multiplication of the Fourier Transform aligns perfectly, and the sum in the Fourier formula produces a strong response for that k.
If not, the rotations cancel out, meaning that frequency doesn’t play a big role in that layer’s representation.
So, the Fourier Transform isn’t adding anything new; it is just finding out how our layer encodes information across different scales of abstraction.
It’s like zooming out and realizing:
- Some layers hum quietly with smooth, conceptual meanings (low frequencies),
- Others buzz with sharp, detailed interactions between tokens (high frequencies).
The FFT basically turns a layer’s hidden states into a frequency fingerprint — a map of what kinds of information that layer is focusing on.
And that’s exactly what SpectralKD uses to figure out which layers are actually doing the heavy lifting during knowledge distillation.
If you still need the visualization and more intuition of the Fourier transform, you can just go through the 3Blue1Brown Video, “But what is the Fourier Transform? A visual introduction.”
From Vision to Language: How Spectral Intensity Guided My Intent Classifier
Source: Author
Let a layer activation tensor be:

where:
- N = number of samples (batch size)
- L = sequence length (number of tokens/time steps)
- H = hidden dimension (number of channels/features produced by the layer)
Each Sample i has an activation matrix Xᵢ ∈ Rᴸ ˣ ᴴ (sequence positions x hidden features)
Now again, you can compute the FFT of that Xᵢ and then measure the frequency length using the real and imaginary components and average out across the channels, and then for each layer.
Frequency length:

Frequency across channels:

Frequency across a layer:

Here, K is the number of bins retained.
Conclusion
Their analysis shows two major insights:
- Not all layers contribute equally. In uniform transformer architectures, only a few early and final layers show strong spectral activity, the true “hotspots” of information flow.
- Different transformer types, similar melodies. Despite architectural variations, both hierarchical and uniform transformers share surprisingly similar spectral patterns, hinting at a universal way these models learn and represent knowledge.
Building on these findings, SpectralKD introduces a simple, parameter-free knowledge distillation (KD) strategy. By selectively aligning the spectral behavior of early and final layers between a teacher and a student model, the student learns to mimic the teacher’s spectral signature, even in intermediate layers that were never explicitly aligned.
The results are striking in the paper: the distilled student (DeiT-Tiny) doesn’t just match performance on benchmarks like ImageNet-1K, it also learns to think spectrally like the teacher, capturing both local and global information with remarkable allegiance.
Ultimately, SpectralKD bridges interpretability and distillation, offering a fresh way to visualize what happens inside transformers during learning. It opens a new line of research, the authors call “distillation dynamics”, a journey into how knowledge itself flows, oscillates, and harmonizes between teacher and student networks.
References
Core Spectral & Transformer Foundations
- Vaswani, A. Attention Is All You Need. NeurIPS, 2017.
- Dosovitskiy, A. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929, 2020.
- Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., & Dosovitskiy, A. Do Vision Transformers See Like Convolutional Neural Networks? NeurIPS, 2021.
- Han, K. et al. A Survey on Vision Transformer. IEEE TPAMI, 2022.
Interpretability & Spectral Analysis
- Chefer, H., Gur, S., & Wolf, L. Transformer Interpretability Beyond Attention Visualization. CVPR, 2021.
- Yeh, C. et al. AttentionViz: A Global View of Transformer Attention. IEEE TVCG, 2023.
- Zeng, J. et al. Peeling Back the Layers: Interpreting the Storytelling of ViT. ACM Multimedia, 2024.
Knowledge Distillation & Model Compression
- Hinton, G. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531, 2015.
- Phuong, M., & Lampert, C. Towards Understanding Knowledge Distillation. ICML, 2019.
- Park, W. et al. Relational Knowledge Distillation. CVPR, 2019.
- Chandrasegaran, K. et al. Revisiting Label Smoothing and Knowledge Distillation Compatibility: What Was Missing? ICML, 2022.
- Huang, T. et al. Knowledge Distillation from a Stronger Teacher. NeurIPS, 2022.
- Pham, C. et al. Frequency Attention for Knowledge Distillation. WACV, 2024.
- Fan, J. et al. ScaleKD: Strong Vision Transformers Could Be Excellent Teachers. arXiv preprint arXiv:2411.06786, 2024.
- Son, S. et al. The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers. ECCV, 2025.
SpectralKD Core Paper



