From a Basic U-Net to a Robust SEM Denoiser

When I first approached the problem of denoising Scanning Electron Microscope (SEM) images, I assumed the key would be choosing the right architecture.

I thought in terms of layers, depth, and parameters.

After some time of research and experiments, I realized the real question is different:

How do you translate an engineer’s intuition into a mathematical objective?

In other words: how do I tell the model what must be preserved (tiny defects) and what it’s allowed to remove (scan noise)?

If you’ve ever been asked to “just do denoising for SEM images” or to improve an existing model and the data + noise feel like a black box, this post is for you.

It walks through the process I went through – from a naïve baseline to a more robust system built around custom loss func…

When I first approached the problem of denoising Scanning Electron Microscope (SEM) images, I assumed the key would be choosing the right architecture.

I thought in terms of layers, depth, and parameters.

After some time of research and experiments, I realized the real question is different:

How do you translate an engineer’s intuition into a mathematical objective?

In other words: how do I tell the model what must be preserved (tiny defects) and what it’s allowed to remove (scan noise)?

If you’ve ever been asked to “just do denoising for SEM images” or to improve an existing model and the data + noise feel like a black box, this post is for you.

It walks through the process I went through – from a naïve baseline to a more robust system built around custom loss functions, which is now integrated into a full application.

The challenge: when noise is part of the signal

SEM (Scanning Electron Microscope) images create a very specific computer-vision challenge:

Low SNR – the noise is not a “nice” Gaussian; it comes from the physics of the scan.
Edge sensitivity – unlike natural images, in wafers every tiny line or texture change may indicate a critical defect in the manufacturing process.
The trade-off
Too much smoothing → you lose defects (false negatives).
Too little denoising → the remaining noise makes it hard for algorithms and engineers to detect real issues.

The goal was to build a model that removes noise aggressively enough to make analysis easier, but protects fine structures and critical defects as much as possible.

Step 1 – Baseline U-Net and the limits of MSE

I started with a fairly standard setup to get a benchmark:

Architecture: vanilla U-Net.
Dataset: Tin-balls – ~50 pairs of noisy + clean images, a small and relatively clean dataset.
Loss function: MSE + SSIM:

Lbaseline=MSE(x,y)+λ⋅(1−SSIM(x,y)) \mathcal{L}_{\text{baseline}} = \mathrm{MSE}(x, y) + \lambda \cdot \bigl(1 - \mathrm{SSIM}(x, y)\bigr)

The metrics looked “okay”, but visually something was off:

the model tended to smear sharp edges.

MSE (Mean Squared Error) heavily punishes large errors (outliers), which encourages the model to “average out” local extremes.

Instead of reconstructing the true texture, it prefers a slightly blurred version that is safer from the loss perspective.

This baseline was still very important: it provided a stable reference point to measure every later improvement against.

Step 2 – Rethinking the problem: moving to Charbonnier loss

Once the baseline worked “fine on paper” but clearly blurred edges and defects, I stopped trying to make the model bigger and asked a different question:

Where exactly is the model wrong?

Looking at examples, difference maps and metrics, I noticed a recurring pattern:

The model reacted too aggressively to single noisy pixels
It was willing to oversmooth entire regions just to reduce a few large local errors

So the problem was not only the model – it was how we defined “error” in the loss.

At this point I went looking for a loss function that would better match SEM noise:

Less sensitive to outliers
Better at preserving textures and edges
Still smooth and differentiable for training deep networks

After comparing several options, Charbonnier loss stood out:

It behaves similarly to ( L_1 ) (more robust than MSE),
but is smooth and differentiable everywhere:

Lchar(x,y)=(x−y)2+ϵ2 \mathcal{L}_{\text{char}}(x, y) = \sqrt{(x - y)^2 + \epsilon^2}

I re-trained the same U-Net, on the same Tin-balls dataset, with the same training setup –

the only change was replacing the MSE term with Charbonnier.

After this change:

PSNR and SSIM improved consistently
Visually, edges looked more natural, with much less over-smoothing around defects

You can see that the fine-tuned model consistently improves all metrics over the baseline, not just PSNR/SSIM.

This was the first mindset shift in the project:

Instead of “make the model bigger”, start with designing the right objective for the model to optimize.

Step 3 – Cracking wafers: adding structure and edge awareness

When I moved to the more complex Wafers dataset — roughly 1,000 wafer SEM image pairs with high-intensity, physically-inspired synthetic noise added on top of clean references — the requirements changed again.

Here, periodic patterns and tiny defects matter even more.

To help the model “understand” that, I upgraded the loss with two additional components:

MS-SSIM (Multi-Scale SSIM)

Instead of measuring similarity at a single resolution, MS-SSIM looks at multiple scales.

This helps the model preserve both:

global structure (macro), and
fine details (micro).

Edge-aware loss

I added a term based on image gradients (using a Sobel operator) that penalizes the model when it breaks or smears edges that exist in the input image.

The full loss became:

Ltotal=Lchar+α⋅LMS-SSIM+β⋅Ledge \mathcal{L}{\text{total}} = \mathcal{L}{\text{char}} + \alpha \cdot \mathcal{L}{\text{MS-SSIM}} + \beta \cdot \mathcal{L}{\text{edge}}

The result was a clear jump in quality:

Defects stayed sharp and visible on a clean background
The model outperformed classical denoising methods such as BM3D or bilateral filters
Both metrics and visual inspection aligned much better with what domain experts expected to see

What didn’t work (and what I learned from it)

A big part of this project was testing ideas that didn’t make it into production.

They were still valuable – each one refined my understanding of the problem.

Residual prediction

Idea: predict the noise (noisy − clean) instead of the clean image itself.

Motivation: the model might find it easier to focus purely on the noise component and avoid touching edges and defects. This is common in many denoising architectures.

In practice, with SEM noise:

Training became less stable
I saw visual artifacts, especially around edges and defect regions

Conclusion: residual learning is not automatically a win – especially when the noise structure is complex and ground truth is limited.

Deeper U-Net

Idea: increase the model capacity (more depth) to better capture complex patterns and spatially varying noise, especially on wafers.

Expectation: better performance on the harder datasets.

Reality:

Training and inference time increased significantly
No clear, consistent improvement in metrics or visual quality

Lesson learned: more capacity is not always more value.

With a well-designed loss, the original architecture was already “good enough”, and adding depth didn’t justify the extra cost.

Edge map as an extra input channel

Idea: explicitly provide an edge map (e.g., Sobel) as an additional input channel, hoping the model would pay more attention to boundaries and defects from the first layer.

Once I switched to a loss that already included an edge-aware term, this turned out to be unnecessary:

The model learned to emphasize edges directly from the raw image
There was no real gain in metrics or visual quality
The input pipeline became more complex for no benefit

This reinforced the idea that it’s often better to encode our priorities in the loss, not just in the inputs.

Benchmarking against classical methods

To validate the approach, I compared the U-Net model against established classical denoisers on the same test set:

The deep learning model clearly outperforms the classical methods in terms of quality metrics.

However, the story is more nuanced when you factor in speed and deployment:

Classical denoisers are simple to deploy, run on CPU, and require no training data or GPUs, which makes them attractive for quick experiments or low-resource environments.
At the same time, they are typically tuned to relatively simple noise models and often struggle with the complex, structured noise in SEM images, forcing a compromise between over-smoothing and under-denoising.
Our U-Net offers the best quality, but requires:
Initial training time and labeled noisy/clean pairs
A GPU for real-time processing (or runs ~2–3× slower on CPU)

The trade-off:

For high-throughput production lines where quality is critical and infrastructure exists → deep learning wins.

For prototyping or lower-volume scenarios → classical methods remain practical.

Engineering it for production

A good model is not enough if it only lives in a notebook.

From the start, the project was built with production in mind:

Modularity – clear separation between data loaders, model definitions, metrics, and experiment code.
Config-driven experiments – every run is defined by a config file, which makes experiments reproducible and makes it easy to tweak hyper-parameters and loss components.
Evaluation pipeline – an automated comparison framework that
computes multiple quality metrics,
saves visual examples, and
compares against classical denoisers.

- Integration – the trained model is wrapped in an API and plugged into an application where users can select a model and see denoising results in real time.

Takeaways

This project taught me that in deep learning, problem definition is just as important as model design.

Shifting the focus from “which architecture should I use?” to

“what exactly do I want the model to optimize?” – via careful loss engineering –

is what allowed us to reach strong performance while preserving the tiny, critical defects that other methods tended to erase.

In the end, a powerful model is not just about using the latest tools.

It’s about combining them with a deep understanding of the data and the domain – and turning that understanding into the right objective for the model to learn.