Deploy data-heavy Python services with warm memory, sticky routing, and zero infrastructure code.
What memrun is
Memrun is a Python SDK and deployment platform for services that are memory-bound, not CPU-bound. The kind of services where you load a 2GB dataset or a large ML model, then serve many requests against it. Where the expensive part is getting data into memory, not processing it.
You write a handler function with a decorator. Memrun handles everything else: provisioning workers on Hetzner VMs, routing related requests to the same worker via Kafka partition keys, managing a 600GB NVMe-backed cache per worker, bounding concurrency with semaphores, and reporting health via heartbeats.
from memrun import MemoryService
svc = MemoryService(
name="doc-search",
memory="32Gi",
disk="600Gi",
max_workers=10,
concurrency=16,
timeout_seconds=300,
)
@svc.handler(sticky_key="corpus_id")
async def handle(ctx, req):
embeddings = await ctx.get_or_fetch(req["embeddings_path"])
results = search(embeddings, req["query"], top_k=req.get("top_k", 10))
return {"matches": results}
Deploy with one command:
memrun deploy handler.py --name doc-search --memory 32Gi --disk 600Gi
That’s it. No Dockerfiles, no Kubernetes manifests, no Terraform. The platform boots VMs, installs your handler, and starts serving.
The problem it solves
Every time I built a data-intensive service on Lambda or Cloud Run, the same pattern emerged:
- Request arrives
- Fetch 1-3GB from S3 (200-800ms)
- Deserialize into working structures (100-500ms)
- Do the actual computation (50-200ms)
- Return result
- Container gets killed
- Next request: repeat from step 1
Steps 2-3 dominate runtime. Steps 6-7 make it worse. The actual work is step 4, but you’re paying for steps 2-3 on every single request.
With memrun, step 2-3 happens once. The LRUCache stores fetched data on NVMe. The SharedWorkerContext keeps decoded structures in memory across requests. Kafka’s sticky routing ensures the same user/dataset combination always lands on the same worker. After the first request, subsequent requests skip straight to step 4.
How it works, concretely
The SDK
You define a MemoryService with resource declarations:
svc = MemoryService(
name="analytics", # DNS-compatible service name
memory="32Gi", # RAM per worker
disk="600Gi", # NVMe cache per worker
max_workers=50, # Maximum worker count
min_workers=2, # Minimum (always-on) workers
concurrency=16, # Max concurrent requests per worker
timeout_seconds=300, # Per-request timeout
env={"MODEL_VERSION": "v3"}, # Environment variables
)
Init handlers
Load expensive resources once per worker lifetime: