I recently joined the "Vibe Coding" movement—letting AI write 90% of my code to ship fast. It felt like magic.
I "prompted" my way to building a Free Video Watermark Remover in a single weekend. No boilerplate, just vibes. I deployed it, shared it on Dev.to, and traffic started pouring in.
But then, the "Vibe Check" failed hard.
The AI wrote functional code, but it didn’t write efficient code. Reality hit me in the form of a server log.
The Disaster: Paying for Failure 💸
I woke up to a sea of red logs: RuntimeError: CUDA out of memory.
Users were uploading raw 4K, 60fps videos recorded on modern iPhones. These files were massive. When fed directly into the AI model (ProPainter on Replicate), t…
I recently joined the "Vibe Coding" movement—letting AI write 90% of my code to ship fast. It felt like magic.
I "prompted" my way to building a Free Video Watermark Remover in a single weekend. No boilerplate, just vibes. I deployed it, shared it on Dev.to, and traffic started pouring in.
But then, the "Vibe Check" failed hard.
The AI wrote functional code, but it didn’t write efficient code. Reality hit me in the form of a server log.
The Disaster: Paying for Failure 💸
I woke up to a sea of red logs: RuntimeError: CUDA out of memory.
Users were uploading raw 4K, 60fps videos recorded on modern iPhones. These files were massive. When fed directly into the AI model (ProPainter on Replicate), the VRAM spiked, and the process crashed after about 4 minutes.
Here is the kicker with Serverless GPU providers like Replicate: You pay for the compute time, even if the task fails.
I was paying $0.20 - $0.30 for every crashed session. I was literally burning money to frustrate my users. The "Vibe" was gone.
The Naive Solution vs. The Engineering Solution
My first instinct (and the AI’s suggestion) was to "Vertical Scale": Just upgrade the hardware. Switch from an NVIDIA T4 to an A100 (40GB VRAM).
- The problem: A100s are expensive. This would triple my costs per second. It doesn’t solve the inefficiency; it just masks it with money.
Then I realized: Does a TikTok reposter really need 4K 60fps? Most social media platforms compress videos to 1080p or 720p anyway. Processing 60 frames per second for a watermark removal is a waste of compute—30fps is visually identical for this use case.
The Fix: The CPU Middleware Pattern 🛡️
Vibe coding gets you from 0 to 1, but System Architecture gets you from 1 to 100.
Instead of sending raw videos directly to the GPU, I introduced a CPU Pre-processing Layer. I already had a VPS on Hostinger (which I paid for annually). The CPU cycles there were effectively free (sunk cost).
The New Architecture:
User Uploads -> My Hostinger Server (CPU). 1.
Normalization -> FFmpeg compresses video to 720p @ 30fps. 1.
Inference -> Optimized video sent to Replicate (GPU). 1.
Merge -> AI output video + Original Audio merged back via FFmpeg.
The Code Implementation
Here is the Python logic I used to normalize the inputs before they ever touch the expensive GPU.
import subprocess
def optimize_video_for_ai(input_path, output_path):
"""
Standardizes video to 720p and 30fps to prevent OOM.
Uses 'veryfast' preset to minimize CPU latency.
"""
command = [
"ffmpeg",
"-i", input_path,
"-vf", "scale='min(1280,iw)':-2", # Downscale to 720p max, keep aspect ratio
"-r", "30", # Force 30fps
"-c:v", "libx264",
"-preset", "veryfast", # Prioritize speed
"-crf", "28", # Slight compression to speed up upload
"-an", # Remove audio (we merge original audio later)
"-y",
output_path
]
subprocess.run(command, check=True)