Show HN: Building Memory as a Service with Memrun

Deploy data-heavy Python services with warm memory, sticky routing, and zero infrastructure code.

What memrun is

Memrun is a Python SDK and deployment platform for services that are memory-bound, not CPU-bound. The kind of services where you load a 2GB dataset or a large ML model, then serve many requests against it. Where the expensive part is getting data into memory, not processing it.

You write a handler function with a decorator. Memrun handles everything else: provisioning workers on Hetzner VMs, routing related requests to the same worker via Kafka partition keys, managing a 600GB NVMe-backed cache per worker, bounding concurrency with semaphores, and reporting health via heartbeats.

from memrun import MemoryService

svc = MemoryService(
name="doc-search",
memory="32Gi",
disk="600Gi",
max_workers=10,
concurrency=16,
timeout_seconds=300,
)

@svc.handler(sticky_key="corpus_id")
async def handle(ctx, req):
embeddings = await ctx.get_or_fetch(req["embeddings_path"])
results = search(embeddings, req["query"], top_k=req.get("top_k", 10))
return {"matches": results}

Deploy with one command:

memrun deploy handler.py --name doc-search --memory 32Gi --disk 600Gi

That’s it. No Dockerfiles, no Kubernetes manifests, no Terraform. The platform boots VMs, installs your handler, and starts serving.

The problem it solves

Every time I built a data-intensive service on Lambda or Cloud Run, the same pattern emerged:

Request arrives
Fetch 1-3GB from S3 (200-800ms)
Deserialize into working structures (100-500ms)
Do the actual computation (50-200ms)
Return result
Container gets killed
Next request: repeat from step 1

Steps 2-3 dominate runtime. Steps 6-7 make it worse. The actual work is step 4, but you’re paying for steps 2-3 on every single request.

With memrun, step 2-3 happens once. The LRUCache stores fetched data on NVMe. The SharedWorkerContext keeps decoded structures in memory across requests. Kafka’s sticky routing ensures the same user/dataset combination always lands on the same worker. After the first request, subsequent requests skip straight to step 4.

How it works, concretely

The SDK

You define a MemoryService with resource declarations:

svc = MemoryService(
name="analytics",       # DNS-compatible service name
memory="32Gi",          # RAM per worker
disk="600Gi",           # NVMe cache per worker
max_workers=50,         # Maximum worker count
min_workers=2,          # Minimum (always-on) workers
concurrency=16,         # Max concurrent requests per worker
timeout_seconds=300,    # Per-request timeout
env={"MODEL_VERSION": "v3"},  # Environment variables
)

Init handlers

Load expensive resources once per worker lifetime:

Loading more...