LLM Model Storage with NFS: Download Once, Infer Everywhere
digitalocean.com·5d
🗂️IPFS
Preview
Report Post

Your vLLM pods are probably downloading the same massive model file every time they start.

If you’ve deployed LLM inference on Kubernetes, you may have taken the straightforward path: point vLLM at HuggingFace and let it download the model when the pod starts. It works. But here’s what happens next:

  • A pod crashes at 2 AM. The replacement pod spends several minutes downloading gigabytes of model weights from HuggingFace before it can serve a single request.
  • You need to scale up during a traffic spike. Each new pod downloads the model independently, competing for bandw…

Similar Posts

Loading similar posts...