In the previous section, I described an apparent transition during training: volatile representations early on, followed by a phase of rapid structural alignment, and finally a stabilization period where loss improvements slow but internal consistency increases. Assuming this pattern is not an artifact, the next question becomes more fundamental: What changes in the learning dynamics cause this transition to occur? Rather than treating training as a smooth, monotonic process, it may be more accurate to view it as a sequence of qualitatively different regimes.
A Possible Phase Transition in Learning
One way to interpret the observed behavior is as a phase transition in representation space. Early in training, parameter updates appear dominated by large gradients responding to easy-t…
In the previous section, I described an apparent transition during training: volatile representations early on, followed by a phase of rapid structural alignment, and finally a stabilization period where loss improvements slow but internal consistency increases. Assuming this pattern is not an artifact, the next question becomes more fundamental: What changes in the learning dynamics cause this transition to occur? Rather than treating training as a smooth, monotonic process, it may be more accurate to view it as a sequence of qualitatively different regimes.
A Possible Phase Transition in Learning
One way to interpret the observed behavior is as a phase transition in representation space. Early in training, parameter updates appear dominated by large gradients responding to easy-to-exploit statistical regularities. These updates reshape embeddings aggressively, but not coherently. To quantify this intuition, I extended the training loop to explicitly track gradient magnitude and representation drift:
def flatten_params(tensor):
return tensor.view(-1)
prev_embedding = None
for epoch in range(num_epochs):
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
grad_norm = model.encoder.weight.grad.norm().item()
optimizer.step()
with torch.no_grad():
current_embedding = model.encoder.weight.clone()
emb_norm = current_embedding.norm().item()
if prev_embedding is not None:
drift = torch.norm(
flatten_params(current_embedding) -
flatten_params(prev_embedding)
).item()
else:
drift = 0.0
print(
f"Epoch {epoch} | "
f"Loss: {loss.item():.4f} | "
f"GradNorm: {grad_norm:.2f} | "
f"EmbNorm: {emb_norm:.2f} | "
f"Drift: {drift:.4f}"
)
prev_embedding = current_embedding
What stood out was that:
• Gradient norms were largest early, even when loss reductions were modest • Embedding drift was extreme during early epochs • Drift dropped sharply after a certain point, even though gradients remained non-zero This suggests that early gradients primarily drive exploration of parameter space rather than refinement of stable structure.
Optimization vs. Organization
This observation hints at an important distinction that loss alone does not capture: • Optimization: minimizing task error • Organization: imposing structure and consistency on internal representations Early training is dominated by optimization pressure. Later training appears to emphasize organization, even when the task objective changes little. To probe this more directly, I tracked pairwise cosine similarity between randomly sampled token embeddings across epochs:
import random
import torch.nn.functional as F
def sample_cosine_stats(embeddings, num_samples=100):
indices = random.sample(range(embeddings.size(0)), num_samples)
cosines = []
for i in range(len(indices) - 1):
a = embeddings[indices[i]]
b = embeddings[indices[i + 1]]
cosines.append(F.cosine_similarity(a, b, dim=0).item())
return sum(cosines) / len(cosines)
with torch.no_grad():
cosine_mean = sample_cosine_stats(model.encoder.weight)
print(f"Epoch {epoch} | Mean Pairwise Cosine: {cosine_mean:.4f}")
During early epochs, cosine statistics fluctuated wildly. Mid-training, they shifted rapidly toward consistent ranges, indicating emerging geometry. Late training showed minimal variance, even when loss had largely saturated. This reinforces the idea that internal geometry continues evolving long after task performance stabilizes.
Why Representations Stabilize Even When Loss Does Not
One puzzling outcome was that representation similarity between epochs continued increasing even when loss improvement was negligible. This suggests that gradients were still shaping the model—but along dimensions invisible to the objective. A plausible explanation is that, once the model enters a low-loss basin, gradient descent mostly moves parameters within an equivalence class of solutions. Outputs remain similar, but internal redundancy decreases and representations become smoother. This can be approximated by tracking singular value decay of the embedding matrix:
with torch.no_grad():
u, s, v = torch.linalg.svd(model.encoder.weight, full_matrices=False)
spectral_energy = (s / s.sum()).cpu()
print(
f"Epoch {epoch} | "
f"Top-5 Singular Energy: {spectral_energy[:5].sum():.4f}"
)
Late training often showed: • Increasing concentration of spectral energy • Reduced effective rank • More anisotropic embedding spaces These changes align with emergent structure rather than improved prediction accuracy.
Rethinking Overfitting as a Developmental Stage
This lens reframes overfitting. Instead of a terminal failure mode, it may act as a developmental constraint that forces the model to commit to specific distinctions. Empirically, models that were heavily regularized from the start showed: • Lower early drift • Weaker mid-training clustering • Less stable late-stage representations Conversely, temporarily allowing overfitting seemed to accelerate representational alignment—suggesting that structure must exist before it can be generalized.
Signals That May Matter More Than Loss
If loss is insufficient for understanding learning dynamics, other metrics may be more revealing: • Representation drift between epochs • Cosine similarity convergence • Effective rank or spectral entropy • Gradient-to-drift ratio • Sensitivity of embeddings to noise perturbations For example, injecting small Gaussian noise into inputs revealed that late-stage representations were significantly more invariant than early ones, even at identical loss levels.
A Lingering Uncertainty
Despite these findings, an uncomfortable possibility remains: What if stabilization reflects numerical inertia rather than semantic abstraction? Without transfer tasks or probing classifiers, it is difficult to disambiguate meaningful structure from converged parameterization. Representation stability is likely necessary for abstraction but not sufficient.
Reframing the Core Question
At this point, the question can be stated more precisely: Is training best understood as minimizing loss, or as guiding representations through unstable exploratory regimes toward constrained, reusable geometries?
If the latter is even partially true, then training dynamics—not just end metrics—deserve closer attention. I still hesitate to claim novelty. These ideas likely overlap with concepts such as neural collapse, spectral bias, or information bottlenecks. What feels distinct is seeing them not as isolated effects, but as successive phases of a single developmental process. Which leaves the question open, but sharper: If abstraction only emerges after instability, how many of our training practices are unintentionally optimized to prevent the very conditions that make abstraction possible? For now, this remains less a conclusion than a suspicion but one increasingly supported by the behavior of the models themselves.