Introduction
Video inpainting looks deceptively similar to image inpainting.
After all, a video is just a sequence of images — right?
In practice, this assumption is responsible for most visual artifacts seen in automated video restoration systems today.
The most common symptom is flicker: unstable textures, jittering edges, and inconsistent motion in repaired regions.
This post explains why per-frame approaches fail, why optical flow only partially helps, and how modern spatiotemporal models address the problem.
⸻
The Per-Frame Trap
In a per-frame pipeline, each frame is processed independently using an image inpainting model.
Mathematically, this optimizes spatial quality: • Sharp edges • Plausible textures • Local realism
What it does not optimize is temporal cohere…
Introduction
Video inpainting looks deceptively similar to image inpainting.
After all, a video is just a sequence of images — right?
In practice, this assumption is responsible for most visual artifacts seen in automated video restoration systems today.
The most common symptom is flicker: unstable textures, jittering edges, and inconsistent motion in repaired regions.
This post explains why per-frame approaches fail, why optical flow only partially helps, and how modern spatiotemporal models address the problem.
⸻
The Per-Frame Trap
In a per-frame pipeline, each frame is processed independently using an image inpainting model.
Mathematically, this optimizes spatial quality: • Sharp edges • Plausible textures • Local realism
What it does not optimize is temporal coherence.
Small stochastic differences between frames accumulate into visible instability during playback.
Each frame is “correct” — but the sequence is not.
⸻
Optical Flow: Useful but Fragile
To improve consistency, traditional systems introduced optical flow.
The idea is simple: 1. Estimate pixel motion between frames 2. Use motion vectors to propagate known content into missing regions
This works well under limited conditions: • Static backgrounds • Slow camera motion • Minimal occlusion
However, optical flow breaks down when: • Foreground objects occlude background regions • Motion is non-linear or chaotic • Lighting changes rapidly
Once flow estimation fails, artifacts propagate instead of being corrected.
⸻
Spatiotemporal Deep Learning
Modern approaches abandon frame independence entirely.
Instead of processing images sequentially, spatiotemporal models process volumes of video.
Key techniques include: • 3D convolutional networks for joint space-time feature extraction • Attention mechanisms that reference multiple frames simultaneously • Transformer-based architectures that model long-range temporal dependencies
These models learn which visual information remains consistent across time — and which does not.
This fundamentally changes how missing regions are reconstructed.
⸻
Measuring Temporal Consistency
Temporal quality cannot be evaluated using single-frame metrics.
Common approaches include: • Feature similarity across consecutive frames (e.g., VGG-based metrics) • Temporal Flicker Index (TFI) • Optical-flow residual stability scores
These metrics better correlate with human perception of video quality.
⸻
Practical Implications
Temporal modeling is not an academic detail.
It directly determines whether a system is usable in: • Video restoration • Object removal • Watermark erasure • Generative video editing
Any pipeline that ignores temporal consistency will fail under real-world conditions.
⸻
Conclusion
The biggest mistake in video AI is treating time as an afterthought.
Per-frame methods optimize images. Spatiotemporal methods optimize videos.
Understanding this distinction explains why many tools fail — and why newer architectures are finally closing the gap between automated video processing and professional results.
This post is adapted from a longer technical article exploring the full evolution from optical flow to spatiotemporal AI: