NVFP4 With CUDA 13 Full Tutorial, 100%+ Speed Gain + Quality Comparison & New Cheap Cloud SimplePod

Full tutorial: https://www.youtube.com/watch?v=yOj9PYq3XYM

Finally NVFP4 models has arrived to ComfyUI thus SwarmUI with CUDA 13. NVFP4 models are literally 100%+ faster with minimal impact on quality. I have done grid quality comparison to show you the difference on FLUX 2, Z Image Turbo and FLUX 1 of NVFP4 versions. To make CUDA 13 work, I have compiled Flash Attention, Sage Attention & xFormers for both Windows and Linux with all of the CUDA archs to support literally all GPUs starting from GTX 1650 series, RTX 2000, 3000, 4000, 5000 series and more.

In this full tutorial, I will show you how to upgrade your ComfyUI and thus SwarmUI to use latest CUDA 13 with latest libraries and Torch 2.9.1. Moreover, our compiled libraries…

Full tutorial: https://www.youtube.com/watch?v=yOj9PYq3XYM

In this full tutorial, I will show you how to upgrade your ComfyUI and thus SwarmUI to use latest CUDA 13 with latest libraries and Torch 2.9.1. Moreover, our compiled libraries such as Sage Attention works with all models on all GPUs without generating black images or videos such as Qwen Image or Wan 2.2 models. Hopefully LTX 2 presets and tutorial coming soon too. Finally, I introduce a new private cloud GPU platform called as SimplePod like RunPod. This platform has all the features of RunPod same way but much faster and cheaper.

📂 Resources & Links:

ComfyUI Installers: [ https://www.patreon.com/posts/ComfyUI-Installers-105023709 ]

SimplePod: [ https://simplepod.ai/ref?user=secourses ]

SwarmUI Installer, Model Auto Downloader and Presets: [ https://www.patreon.com/posts/SwarmUI-Install-Download-Models-Presets-114517862 ]

How to Use SwarmUI Presets & Workflows in ComfyUI + Custom Model Paths Setup for ComfyUI & SwarmUI Tutorial: [ https://youtu.be/EqFilBM3i7s ]

SECourses Discord Channel for 7/24 Support: [ https://discord.com/invite/software-engineering-courses-secourses-772774097734074388 ]

NVIDIA NVFP4 Blog Post More: [ https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ ]

⏱️ Video Chapters:

00:00:00 New ComfyUI installer (CUDA 13, Torch 2.9.1, Triton + attention libs)
00:00:19 NVFP4 speedup claims vs real tests; why CUDA 13 enables new models
00:00:34 Prebuilt FlashAttention/SageAttention/xFormers for many GPUs (Windows + Linux)
00:01:00 Quality roadmap: FLUX2 Dev, Z Image Turbo, FLUX Dev (BF16/FP8/GGUF/NVFP4)
00:01:23 Downloader adds NVFP4: FLUX2 Dev, FLUX Dev (Context/Dev), Z Image Turbo
00:01:51 SimplePod AI intro: RunPod-style pods, cheaper rates, permanent storage
00:02:36 Musubi Tuner FP8 Scaled: quality myths vs GGUF + why scaled matters
00:03:10 Quantization & precision (FP32/BF16/FP8/GGUF) + Qwen3 low-VRAM encoders
00:03:34 ComfyUI v73 zip: CUDA 13 included; update NVIDIA drivers only (v72 deprecated)
00:04:13 Update steps: overwrite zip, delete venv, run install/update .bat
00:05:02 Python: 3.10 recommended (supports 3.10-3.13); fresh vs update
00:06:02 New installer flow: uv speed, standalone use, backend libs detected
00:07:12 Stability flags: –cache-none vs –disable-smart-memory (OOM/stuck fixes)
00:07:54 SwarmUI presets: 32 presets supported; drag/drop + auto model downloader
00:08:25 Update SwarmUI model-downloader zip (extract + overwrite)
00:08:49 Download bundles/models (Z Image Turbo Core + NVFP4 options)
00:09:25 Update/launch SwarmUI; point to updated ComfyUI backend + set args
00:10:32 Live gen test: Z Image Turbo BF16 @1536x1536
00:11:29 Switch to NVFP4: VRAM cache behavior; 1024x1024
00:12:36 FLUX2 Dev quality: FP8 Scaled vs NVFP4 side-by-side comparisons
00:13:33 Speed chart: FLUX2 NVFP4 about 193% faster than FP8 Scaled
00:14:10 Z Image Turbo quality: BF16 vs NVFP4 vs FP8 Scaled (quant method)
00:15:25 FLUX Dev: FP8 Scaled approx GGUF Q8; NVFP4 currently shows degradation
00:16:45 What precision means + model size examples (FP32/BF16/FP8 Scaled/NVFP4)
00:18:07 Practical recommendations: BF16 best; avoid FP16; raw FP8 vs FP8 Scaled
00:19:43 GGUF explained: block quant, slower runtime; use only when RAM is too low
00:21:36 Precision hierarchy recap + when to pick FP8 mixed/scaled over GGUF
00:21:58 SimplePod setup: register, add credits, open template link
00:22:31 Template config + RunPod price comparison (disk, ports, GPU selection)
00:24:02 Persistent volume: create + mount to /workspace
00:25:11 Launch RTX Pro 6000 pod; SimplePod vs RunPod pricing differences
00:26:29 Temp vs persistent disk: deleting instance wipes temp data - backup!
00:26:55 JupyterLab: upload zips, apt install zip, unzip ComfyUI in workspace
00:27:48 Run install script; unzip SwarmUI; start the model downloader
00:29:02 Downloader path for ComfyUI + folder structure; download Z Image Turbo bundle
00:30:08 Start ComfyUI; confirm CUDA 13 + Torch 2.9.1; connect via port 3000 Direct
00:31:08 Preset demo: Z Image Turbo Quality 1; fix VAE path; monitor VRAM
00:33:18 File Browser Direct: download outputs/models fast; upload files back
00:34:41 Restart server; install/start SwarmUI; open Cloudflared URL
00:36:26 SwarmUI backend: /workspace/ComfyUI/main.py + args; import presets
00:37:27 Download FLUX2 Core + NVFP4; share model paths between SwarmUI & ComfyUI
00:39:27 FLUX2 NVFP4 generation @2048x2048; VRAM usage + step speed
00:40:43 Cloud GPU pitfall: diagnosing a power-capped GPU
00:41:28 Resume: re-run template w/ volume; reconnect fast
00:45:02 Wrap-up: SimplePod pros (direct/secure, cheaper storage)