has a tricky secret. Organizations deploy models that achieve 98% accuracy in validation, then watch them quietly degrade in production. The team calls it “concept drift” and moves on. But what if this isn’t a mysterious phenomenon — what if it’s a predictable consequence of how we optimize?
I started asking this question after watching another production model fail. The answer led somewhere unexpected: the geometry we use for optimization determines whether models stay stable as distributions shift. Not the data. Not the hyperparameters. The space itself.
I realized that credit risk is fundamentally a ranking problem, not a classification problem. You don’t need to predict “default” or “no default” with 98% accuracy. You need to order borrowers by risk: Is Borrower A riski…
has a tricky secret. Organizations deploy models that achieve 98% accuracy in validation, then watch them quietly degrade in production. The team calls it “concept drift” and moves on. But what if this isn’t a mysterious phenomenon — what if it’s a predictable consequence of how we optimize?
I started asking this question after watching another production model fail. The answer led somewhere unexpected: the geometry we use for optimization determines whether models stay stable as distributions shift. Not the data. Not the hyperparameters. The space itself.
I realized that credit risk is fundamentally a ranking problem, not a classification problem. You don’t need to predict “default” or “no default” with 98% accuracy. You need to order borrowers by risk: Is Borrower A riskier than Borrower B? If the economy deteriorates, who defaults first?
Standard approaches miss this completely. Here’s what gradient-boosted trees (XGBoost, the field’s favorite tool) actually achieve on the Freddie Mac Single-Family Loan-Level Dataset (692,640 loans spanning 1999–2023):
- Accuracy: 98.7% ← looks impressive
- AUC (ranking ability): 60.7% ← barely better than random
- 12 months later: 96.6% accuracy, but ranking degrades
- 36 months later: 93.2% accuracy, AUC is 66.7% (essentially useless)
XGBoost achieves an impressive accuracy but fails at the actual task: ordering risk. And it degrades predictably.
Now compare this to what I’ve developed (presented in a paper accepted in IEEE DSA2025):
- Initial AUC: 80.3%
- 12 months later: 76.4%
- 36 months later: 69.7%
- 60 months later: 69.7%
The difference: XGBoost loses 32 AUC points over 60 months. Our approach? Just 10.6 points in AUC — (Area Under de Curve) is what will tell us how our trained algorithm will predict risk on unseen data.
Why does this happen? It comes down to something unexpected: the geometry of optimization itself.
Why This Matters (Even If You’re Not in Finance)
This isn’t just about credit scores. Any system where ranking matters more than exact predictions faces this problem:
- Medical risk stratification — Who needs urgent care first?
- Customer churn prediction — Which customers should we focus retention efforts on?
- Content recommendation — What should we show next?
- Fraud detection — Which transactions merit human review?
- Supply chain prioritization — Which disruptions to address first?
When your context changes gradually — and whose doesn’t? — accuracy metrics lie to you. A model can maintain 95% accuracy while completely scrambling the order of who’s actually at highest risk.
That’s not a model degradation problem. That’s an optimization problem.
What Physics Teaches Us About Stability
Think about GPS navigation. If you only optimize for “shortest current route,” you might guide someone onto a road that’s about to close. But if you preserve the structure of how traffic flows — the relationships between routes — you can maintain good guidance even as conditions change. That’s what we need for credit models. But how do you preserve structure?
NASA has faced this exact problem for years. When simulating planetary orbits over millions of years, standard computational methods make planets slowly drift — not because of physics, but because of accumulated numerical errors. Mercury gradually spirals into the Sun. Jupiter drifts outward. They solved this with symplectic integrators: algorithms that preserve the geometric structure of the system. The orbits stay stable because the method respects what physicists call “phase space volume” — it maintains the relationships between positions and velocities.
Now here’s the surprising part: credit risk has a similar structure.
The Geometry of Rankings
Standard gradient descent optimizes in Euclidean space. It finds local minima for your training distribution. But Euclidean geometry doesn’t preserve relative orderings when distributions shift.
What does?
Symplectic manifolds.
In Hamiltonian mechanics (a formalism used in physics), conservative systems (no energy loss) evolve on symplectic manifolds — spaces with a 2-form structure that preserves phase space volume (Liouville’s theorem).
Standard Symplectic 2-Form
In this phase space, symplectic transformations preserve relative distances. Not absolute positions, but orderings. Exactly what we need for ranking under distribution shift. When you simulate a frictionless pendulum using standard integration methods, energy drifts. The pendulum in Figure 1 slowly speeds up or slows down — not because of physics, but because of numerical approximation. Symplectic integrators don’t have this problem because they preserve the Hamiltonian structure exactly. The same principle can be applied to neural network optimization.
Figure 1. Frictionless pendulum is the most basic example of Hamiltonian mechanics. Pendulum hasn’t friction with air as it would dissipate energy. Hamiltonian formalism in physics is applicable to conservative or non-dissipative systems with energy conservation. The image in the left show the trajectory of the pendulum in the phase space, represented by the velocity and the angle (central image). Image by author.
Protein folding simulations face the same problem. You’re modeling thousands of atoms interacting over microseconds to milliseconds — billions of integration steps. Standard integrators accumulate energy: molecules heat up artificially, bonds break that shouldn’t, the simulation explodes.
Figure 2: Equivalence between “Hamiltonian in physical systems”, and its application in NN optimization spaces. Position q is equivalent to the NN parameters θ, and momentum vector pis equivalent to the difference between consecutive parameters states. Despite we can call it “physics inspiration”, this is applied differential geometry symplectic forms, Liouville’s theorem, structure-preserving integration. But I think Hamiltonian analogy has more sense for divulgation purposes. Image by author.
The Implementation: Structure-Preserving Optimization
Here’s what I actually did:
Hamiltonian Framework for Neural Networks
I reformulated neural network training as a Hamiltonian system:
Hamiltonian Equation For Mechanical Systems
In Mechanical systems, T(p) is the kinetic energy term, and V(q) is the ’potential energy. In this analogy T(p) represents the cost of changing the model parameters, and V(q) represents the loss function of the current model state.
Symplectic Euler optimizer (not Adam/SGD):
Instead of Adam or SGD for optimizing, I use a symplectic integration:

I have used the symplectic Euler method for a Hamiltonian system with position q and momentum p
Where:
- H is the Hamiltonian (energy function derived from the loss)
- Δt is the time step (analogous to learning rate)
- q are the network weights (position coordinates), and
- p are momentum variables (velocity coordinates)
Notice that p_{t+1} appears in both updates. This coupling is important — it’s what preserves the symplectic structure. This isn’t just momentum; it’s structure-preserving integration.
Hamiltonian-constrained loss
Moreover, I have created a loss based on the Hamiltonian formalism:

Where:
- L_base(θ) is binary cross-entropy loss
- R(θ) is regularization term (L2 penalty on weights), and
- λ is regularization coefficient
The regularization term penalizes deviations from energy conservation, constraining optimization to low-dimensional manifolds in parameter space.
How It Works
The mechanism has three components:
- Symplectic structure → volume preservation → bounded parameter exploration
- Hamiltonian constraint → energy conservation → stable long-term dynamics
- Coupled updates → preserves geometric structure relevant for ranking
This structure is represented in the following algorithm
Figure 3: Algorithm used applied both the momentum update and the Hamiltonian optimization.
The Results: 3x Better Temporal Stability
As explained, I tested this framework using Freddie Mac Single-Family Loan-Level Dataset — the only long-term credit dataset with proper temporal splits spanning economic cycles.

The logic tell us that accuracy has to decrease across the three datasets (from 12 to 60 months). Long horizon predictions use to be less accurate than short term. But what we see is that XGBoost does not follow this pattern (AUC values from 0.61 to 0.67 — this is the signature of optimization in the wrong space)- Our symplectic optimizer, despite showing less accuracy, does it (AUC values decrease from 0.84 to 0.70). For example, what does guarantee you that a prediction for 36 is going to more realistic? The 0.97 accuracy of XGBoost or the 0,77 AUC value from the Hamiltonian inspired approach? XGBoost has for 36 months an AUC of 0.63 (very close to a random prediction).
What Each Component Contributes
In our ablation study, all components contribute, with momentum in symplectic space providing larger gains. This aligns with the theoretical backgroun— the symplectic 2-form is preserved through coupled position-momentum updates.
Table. Ablation Study. Standard NN with Adam optimizer vs. our approach (Full Hamiltonian Model)
When to Use This Approach
Use symplectic optimization as alyternative to gradient descent optimizers when:
- Ranking matters more than classification accuracy
- Distribution shift is gradual and predictable (economic cycles, not black swans)
- Temporal stability is critical (financial risk, medical prognosis over time)
- Retraining is expensive (regulatory validation, approval overhead)
- You can afford 2–3x training time for production stability
- You have <10K features (works well up to ~10K dimensions)
Don’t Use When:
- Distribution shift is abrupt/unpredictable (market crashes, regime changes)
- You need interpretability for compliance (this doesn’t help with explainability)
- You’re in ultra-high dimensions (>10K features, cost becomes prohibitive)
- Real-time training constraints (2–3x slower than Adam)
What This Actually Means for Production Systems
For organizations deploying credit models or similar challenges:
Problem: You retrain quarterly. Each time, you validate on holdout data, see 97%+ accuracy, deploy, and watch AUC degrade over 12–18 months. You blame “market conditions” and retrain again.
Solution: Use symplectic optimization. Accept slightly lower peak accuracy (80% vs 98%) in exchange for 3x times better temporal stability. Your model stays reliable longer. You retrain less often. Regulatory explanations are simpler: “Our model maintains ranking stability under distribution shift.”
Cost: 2–3x longer training time. For monthly or quarterly retraining, this is acceptable — you’re trading hours of compute for months of stability.
This is engineering, not magic. We’re optimizing in a space that preserves what actually matters for the business problem.
The Bigger Picture
Model degradation isn’t inevitable. It’s a consequence of optimizing in the wrong space. Standard gradient descent finds solutions that work for your current distribution. Symplectic optimization finds solutions that preserve structure — the relationships between examples that determine rankings. Our proposed approach won’t solve every problem in ML. But for the practitioner watching their production model decay — for the organization facing regulatory questions about model stability — it’s a solution that works today.
Next Steps
The code is available: [link]
The full paper: Will be available soon. Contact me if you are interested in receiving it ([email protected])
Questions or collaboration: If you’re working on ranking problems with temporal stability requirements, I’d be interested to hear about your use case.
Thank you for reading — and sharing!
Need help implementing this kind of systems?
Javier Marin Applied AI Consultant | Production AI Systems + Regulatory Compliance [email protected]