Opsmas 2025 Day 13: mesh image router cdn-like thing & claude prompt

TOC:

mir
claude prompt

mesh image router

I wrote mesh image router five years ago to be the backend of a photo sharing app. The CDN mesh could be deployed on any high storage capacity, high network throughput servers as a self-routing ad-hoc content-addressed data fetching. I refreshed it with more docs and features a bit this year.

architecture

TOC:

mir
claude prompt

mesh image router

architecture

mesh image router (mir) uses a storage and location aware self-routing hierarchy. You could even place a “real” CDN in front of mir but mir is designed to route and manage petabytes of storage on the backend itself (though it expects to have an ultimate fallback origin server for all content, so the image router tiny cdn does expire unused (or less-requested) content over time).

The mesh image router operates hierarchically where you can provision:

more expensive fast edge nodes for the “hottest objects” (big ram + nvme + 400+ Gbps networking)
but if content isn’t found at the “hot and fast” layer, it falls back to the next mid layer (larger multi-terabyte SSD servers, 200+ Gbps networking)
but if content isn’t found at the next mid layer, it falls back to the bulk-before-last-resort layer (multi-petabyte HDD storage, 100+ Gbps networking)
but if content isn’t found in the bulk-before-last-resort, it gets fetched from the cold storage (B2/S3 etc).

When images are found on a lower level when requested, they get fetched into the higher level as replicated (and LRU-managed) content for faster future serving, also since this is an image router it is aware of image sizes and can provision/scale/deduplicate images based on size classes as well (i.e. detecting user intent for the quality of an image to return and figuring out the best “available right now” capacity to serve the request even if the data isn’t existing yet; i.e. web request for a 320x240 thumbnail, but you don’t have the original image, but you do have a 1024x768 already downscaled image, so you re-downscale the 1024x768 to the new thumbnail request size instead of "pulling the original, resize original (could be 15+ MB) down to new thumbnail), etc).

basically, all serving follows a path of:

Content Routing: Content IDs are hashed to determine the owning peer
Local Serving: If content exists locally on the owner, serve directly
Peer Streaming: If content is on a peer, stream via control protocol (without copying data)
Upstream Fetch: If not cached, fetch from upstream level
Origin Fetch: If not in any cache, fetch from S3 origin
On-demand Resize: Generate smaller sizes from larger cached versions

One interesting feature I found when building the system is pypy allows this system to serve over 1 million updates per second while just regular cpython can’t push this architecture more than about 300k requests per second. So, I make sure to leave everything “pypy compliant” and not include any extra unnecessary features or modules which are unsupported by pypy.

protocol

mir implements a custom binary protocol tagged with activities of:

Private Control Protocol

Binary TCP protocol with 5-byte length prefix:

┌─────────────────────────────────────────────────────┐
│  Frame Structure                                    │
├─────────────────────────────────────────────────────┤
│  [5 bytes: length] [1 byte: command] [N bytes: data]│
└─────────────────────────────────────────────────────┘

Commands:

Byte	Command	Payload
`#`	Heartbeat	`HEARTBEAT` literal
`^`	Ping	`PING` literal
`*`	Pong	`PONG` literal (response)
`+`	Request	File path (UTF-8)
`=`	Save	3-byte JSON len + JSON + binary
`@`	Record	JSON hit record
`-`	Warm	File path for pre-caching
`&`	Rebalance	Same format as Save

Save/Rebalance Payload Structure:

┌────────────────────────────────────────────────────────────┐
│  [3 bytes: JSON length] [JSON metadata] [binary content]   │
└────────────────────────────────────────────────────────────┘

JSON metadata: {"relname": "relative/path/to/file.ext"}

Content Request Flow

1. Client Request
GET /puddle/{webId}/{size}/{filename.ext}
│
▼
2. Path Resolution
┌─────────────────────────────────┐
│ Check webPathCache              │
│   → If cached, use cached path  │
│   → If False, return 404        │
│   → If None, resolve via API    │
└─────────────────────────────────┘
│
▼
3. Content ID Mapping
┌─────────────────────────────────┐
│ webID → contentId via API       │
│ contentId → filesystem path     │
└─────────────────────────────────┘
│
▼
4. Peer Routing
┌─────────────────────────────────┐
│ Hash contentId to peer          │
│   → If self: continue           │
│   → If other: stream from peer  │
└─────────────────────────────────┘
│
▼
5. Local Cache Check
┌─────────────────────────────────┐
│ File exists on disk?            │
│   → Yes: serve + LRU increment  │
│   → No: try generation/fetch    │
└─────────────────────────────────┘
│
▼
6. Image Generation
┌─────────────────────────────────┐
│ Higher-res version exists?      │
│   → Yes: request resize         │
│   → No: fetch from upstream     │
└─────────────────────────────────┘
│
▼
7. Upstream Fetch
┌─────────────────────────────────┐
│ Level 1+ cache → Control fetch  │
│ Origin level → S3 GET + resize  │
└─────────────────────────────────┘
│
▼
8. Response
FileResponse with max-cache headers

Upload Flow

1. POST /upload-propagate
multipart/form-data: meta (JSON) + afile (binary)
│
▼
2. Authentication
┌─────────────────────────────────┐
│ Validate upload key from meta   │
│   → Invalid: 401 + 3s delay     │
└─────────────────────────────────┘
│
▼
3. Local Storage
┌─────────────────────────────────┐
│ Stream to temp file             │
│ Atomic rename to target path    │
│ Skip if file already exists     │
└─────────────────────────────────┘
│
▼
4. Upstream Propagation
┌─────────────────────────────────┐
│ Push to upstream ControlNode    │
│ via = protocol command          │
└─────────────────────────────────┘

Multi-Process Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Main Process                           │
│  ┌─────────────────────────────────────────────────────────┐│
│  │                    LRUManager                           ││
│  │              (multiprocessing.Manager)                  ││
│  └─────────────────────────────────────────────────────────┘│
│                            │                                │
│          ┌─────────────────┼─────────────────┐              │
│          ▼                 ▼                 ▼              │
│  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐        │
│  │  Worker 0   │   │  Worker 1   │   │  Worker N   │        │
│  │ MeshConfig  │   │ MeshConfig  │   │ MeshConfig  │        │
│  │ Public HTTP │   │ Public HTTP │   │ Public HTTP │        │
│  │ Private TCP │   │ Private TCP │   │ Private TCP │        │
│  └─────────────┘   └─────────────┘   └─────────────┘        │
└─────────────────────────────────────────────────────────────┘

Startup Sequence

Parse command line for server name
Find server position in mesh config
Create LRUManager (shared across workers)
Initialize shared LRU with disk capacity
Fork worker processes (one per CPU)
Each worker:

Creates MeshConfig instance
Starts public HTTP server
Starts private TCP server
Connects to peer control nodes
Runs async event loop

Unix Socket Naming

Public servers on Unix sockets use process-indexed paths:

{base_path}.{process_idx}.sock

Example: /tmp/lru-00.0.sock, /tmp/lru-00.1.sock, …

mesh topology definitions

since we have a service mesh, we have to figure out how to connect everything together.

it’s just one big-ass config file:

TOML Topology Format

# config/topology.prod.toml

[meta]
name = "production"
description = "Production CDN mesh for example.com"
version = "1.0"

[s3]
host = "my-bucket.s3.us-west-000.backblazeb2.com"

[services]
content_resolution_api = "http://api.internal:9999/papi/pt"
thumbnail_endpoint = "http://thumbs.internal:6661/job.image.thumbnails.Thumbnails"

# =============================================================================
# Level 0: Edge Cache Nodes (closest to users)
# =============================================================================
# These nodes receive client requests and form a peer mesh.
# Content is distributed across peers using consistent hashing.

[[levels]]

[[levels.servers]]
name = "edge-us-east-1"
cache_dir = "/var/cache/mir/edge-us-east-1"

# Where clients are redirected for this server's content
[levels.servers.redirect_target]
host = "edge1.cdn.example.com"
port = 443

# Unix socket for local nginx proxy (high performance)
[[levels.servers.public]]
type = "unix"
path = "/run/mir/edge1.sock"
workers = "cpus"  # One worker per CPU core

# Direct HTTP access (for health checks, debugging)
[[levels.servers.public]]
type = "net"
host = "0.0.0.0"
port = 8080

# Private control channel for peer communication
[[levels.servers.private]]
host = "edge1.internal.example.com"
port = 5556

[[levels.servers]]
name = "edge-us-west-1"
cache_dir = "/var/cache/mir/edge-us-west-1"

[levels.servers.redirect_target]
host = "edge2.cdn.example.com"
port = 443

[[levels.servers.public]]
type = "unix"
path = "/run/mir/edge2.sock"
workers = "cpus"

[[levels.servers.public]]
type = "net"
host = "0.0.0.0"
port = 8081

[[levels.servers.private]]
host = "edge2.internal.example.com"
port = 5557

[[levels.servers]]
name = "edge-eu-west-1"
cache_dir = "/var/cache/mir/edge-eu-west-1"

[levels.servers.redirect_target]
host = "edge3.cdn.example.com"
port = 443

[[levels.servers.public]]
type = "net"
host = "0.0.0.0"
port = 8082

[[levels.servers.private]]
host = "edge3.internal.example.com"
port = 5558

# =============================================================================
# Level 1: Regional Cache
# =============================================================================
# Aggregates cache misses from edge nodes in a region.
# No public interface - only accessed by edge nodes.

[[levels]]

[[levels.servers]]
name = "regional-us"
cache_dir = "/var/cache/mir/regional-us"

[[levels.servers.private]]
host = "regional-us.internal.example.com"
port = 6666

[[levels.servers]]
name = "regional-eu"
cache_dir = "/var/cache/mir/regional-eu"

[[levels.servers.private]]
host = "regional-eu.internal.example.com"
port = 6667

# =============================================================================
# Level 2: Origin (S3-compatible storage)
# =============================================================================
# The ultimate source of truth. Uses the S3 host defined in [s3] section.

[[levels]]

[[levels.servers]]
name = "origin"
is_s3_origin = true

Section Reference

`[meta]` - Topology Metadata

Field	Type	Required	Description
`name`	string	Yes	Unique name for this topology
`description`	string	No	Human-readable description
`version`	string	No	Version identifier (default: “1.0”)

`[s3]` - Origin Storage

Field	Type	Required	Description
`host`	string	Yes	S3-compatible endpoint hostname

Note: S3 credentials are stored in secrets, not here.

`[services]` - External Services

Field	Type	Required	Description
`content_resolution_api`	string	No	API for web ID → content ID resolution
`thumbnail_endpoint`	string	No	Service for on-demand image resizing

`[[levels]]` - Cache Hierarchy Tiers

Levels are defined as an array. Level 0 is closest to users, higher levels are closer to origin.

`[[levels.servers]]` - Server Definitions

Field	Type	Required	Description
`name`	string	Yes	Unique server identifier
`cache_dir`	string	No	Local filesystem path for cache storage
`is_s3_origin`	boolean	No	True if this is an S3 origin server

`[levels.servers.redirect_target]` - Client Redirect

Field	Type	Required	Description
`host`	string	Yes	Public hostname for redirects
`port`	integer	Yes	Port (typically 443 for HTTPS)

`[[levels.servers.public]]` - Public Interfaces

Unix socket interface:

Field	Type	Required	Description
`type`	string	Yes	Must be `"unix"`
`path`	string	Yes	Socket file path
`workers`	string/int	No	`"cpus"` or specific count (default: `"cpus"`)

Network interface:

Field	Type	Required	Description
`type`	string	Yes	Must be `"net"`
`host`	string	Yes	Bind address (e.g., `"0.0.0.0"`)
`port`	integer	Yes	Listen port (1-65535)

`[[levels.servers.private]]` - Peer Control Channel

Field	Type	Required	Description
`host`	string	Yes	Internal hostname/IP
`port`	integer	Yes	Control port (1-65535)

Topology Dataclasses

hotspot avoidance

we implemnet a “Request Flow with Replication” allowing content to be replicated and detected and served from “not its primary shard” as well.

Client requests contentId="viral123" at Non-Owner Peer 2
│
▼
┌─────────────────────────────────────────────────────────────┐
│  1. Path Resolution (webPathCache lookup)                   │
│     PublicWebPath → ContentId + extension                   │
│     Track: path_cache_hit or path_cache_miss                │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│  2. Local Cache Check (CRITICAL - checks replicas!)         │
│     if os.path.isfile(fsPath):                              │
│       ✓ File exists (owner OR replica)                      │
│       Track: lru_hit                                        │
│       record_access(contentId) → hot tracking               │
│       return FileResponse                                   │
└─────────────────────────────────────────────────────────────┘
│ (if not cached locally)
▼
┌─────────────────────────────────────────────────────────────┐
│  3. Determine Owner via Consistent Hash                     │
│     owner = hash_at_peer_level(contentId)                   │
│     = MD5(contentId) % num_peers                            │
└─────────────────────────────────────────────────────────────┘
│
├─ If owner == self:
│   │
│   ▼
│  ┌───────────────────────────────────────────────────────┐
│  │  Owner Path: Try generate → fetch upstream            │
│  │  Track: lru_miss (if fetching)                        │
│  │  record_access(contentId)                             │
│  └───────────────────────────────────────────────────────┘
│
└─ If owner != self:
│
▼
┌───────────────────────────────────────────────────────┐
│  4. Stream from Owner Peer                            │
│     Track: cross_peer_stream                          │
│     record_access(contentId) → may trigger replication│
│     return StreamResponse                             │
└───────────────────────────────────────────────────────┘

Replication Lifecycle

Time: T0 (First Request)
├─ Non-owner Peer 2 receives request for "viral123"
├─ Not in local cache
├─ Streams from Owner (Peer 1)
└─ record_access("viral123")
└─ webPathCache["_ac_viral123"] = 1

Time: T+10s (Second Request)
├─ Another request at Peer 2
├─ Still streams from owner
└─ webPathCache["_ac_viral123"] = 2

Time: T+20s (Third Request)
├─ Another request at Peer 2
└─ webPathCache["_ac_viral123"] = 3
└─ Promoted to HotItemTracker (thread-safe)
├─ access_count = 3
├─ last_access = T+20
└─ decay_score = 3.0

Time: T+60s (Background Sync Task Runs)
├─ Calculate adaptive threshold
│   └─ If 60K req/min: threshold ≈ 297
│   └─ If 100 req/min: threshold ≈ 5
│
├─ Update decay scores for all tracked items
│   └─ "viral123": score = 3 × e^(-0.693×40/300) = 2.7
│
├─ Find items with score > threshold
│   └─ If threshold=5: "viral123" score too low
│   └─ Need more accesses or lower traffic
│
└─ For hot items: send WARM to neighbors

Time: T+5min (Score=100, viral content!)
├─ decay_score > threshold
├─ Calculate targets: ring neighbors of owner
│   └─ Owner=Peer 1, neighbors=[2,3]
│
├─ Send WARM commands
│   ├─ Peer 1 → Peer 2: WARM /vi/ra/l1/md/viral123.jpg
│   └─ Peer 1 → Peer 3: WARM /vi/ra/l1/md/viral123.jpg
│
├─ Peer 2 receives WARM
│   ├─ Fetches from Peer 1
│   ├─ Caches locally
│   └─ Sends ACK
│
└─ Mark as replicated in bloom filter
└─ Future requests at Peer 2: served from local cache!

all tests pass

stats

claude prompt

the current claude code allows you to format your own in-cli prompt with details somewhat like a normal shell prompt.

i combined all the details i could fetch and added these live-updating details to the claude bottom toolbar area:

the prompt shows you a live view of:

current model choice (a bit of a lie, see rant about claude code’s shitty “shared global config for concurrent sessions” management below)
unique input tokens this session (not counting duplicate sends or cache prefixes)
output tokens (generate by the model) this sessions
context remaining with used tokens vs available tokens
available tokens is automatically reduced by the 77,000 token buffer remaining claude code uses as their “compaction trigger”
lines of code added and removed this session
total time this session has been active
estimate API cost if you were using a cash based plan (subscription plan still shows cost, but obviously you aren’t dynamically charged per-usage)

Just add this to your ~/.claude/settings.json:

where statusline.py is this thing (not worth a repo really):

#!/usr/bin/env python3
"""
Enhanced Claude Code status line.
Layout: prompt (left-aligned) | metadata (right-aligned)
"""
import json
import os
import re
import sys

# Constants
AUTOCOMPACT_BUFFER = 77_000  # Reserved for compaction, adjust if needed
TERM_WIDTH = 140  # Assumed terminal width (can't detect live width in subprocess)

# ANSI colors
C_RESET = "\033[0m"
C_RED = "\033[31m"
C_GREEN = "\033[32m"
C_YELLOW = "\033[33m"
C_BLUE = "\033[34m"
C_MAGENTA = "\033[35m"
C_CYAN = "\033[36m"
C_DIM = "\033[2m"


def strip_ansi(text):
"""Remove ANSI escape codes to get visible length."""
return re.sub(r"\033\[[0-9;]*m", "", text)


def visible_len(text):
"""Get visible length of text (excluding ANSI codes)."""
return len(strip_ansi(text))


def fmt_k(n):
"""Format number as Xk or X.Xk"""
if n >= 1000:
return f"{n / 1000:.0f}k" if n >= 10000 else f"{n / 1000:.1f}k"
return str(n)


def fmt_duration(ms):
"""Format milliseconds as human-readable duration."""
secs = ms / 1000
if secs < 60:
return f"{secs:.0f}s"
mins = int(secs // 60)
secs = int(secs % 60)
if mins < 60:
return f"{mins}m{secs:02d}s"
hours = int(mins // 60)
mins = int(mins % 60)
return f"{hours}h{mins:02d}m"


def main():
try:
data = json.load(sys.stdin)
except json.JSONDecodeError:
print("[?] error")
return

# Debug: save last input for checking formatting offline or in testing
if False:
try:
with open(
os.path.expanduser("~/.claude/statusline-lastinput.json"), "w"
) as f:
json.dump(data, f, indent=2)
except Exception:
pass

# === Extract data ===

# Model
model = data.get("model", {}).get("display_name", "?")

# Cumulative session tokens
ctx = data.get("context_window", {})
total_in = ctx.get("total_input_tokens", 0)
total_out = ctx.get("total_output_tokens", 0)
ctx_size = ctx.get("context_window_size", 200_000)

# Usable context (excluding autocompact buffer)
usable_ctx = ctx_size - AUTOCOMPACT_BUFFER

# Current context from transcript
current_ctx = 0
transcript_path = data.get("transcript_path")

if transcript_path and os.path.exists(transcript_path):
try:
# Read only the last ~32KB of transcript (much faster for long sessions)
with open(transcript_path, "rb") as f:
f.seek(0, 2)  # End of file
size = f.tell()
read_size = min(size, 32768)
f.seek(max(0, size - read_size))
tail = f.read().decode("utf-8", errors="ignore")

# Parse lines from the end
for line in reversed(tail.splitlines()):
try:
msg = json.loads(line)
usage = msg.get("message", {}).get("usage")
if usage:
current_ctx = (
usage.get("input_tokens", 0)
+ usage.get("cache_creation_input_tokens", 0)
+ usage.get("cache_read_input_tokens", 0)
)
break
except (json.JSONDecodeError, KeyError, TypeError):
continue
except Exception:
pass

# Fallback to cumulative
if current_ctx == 0:
current_ctx = total_in + total_out
approx = "~"
else:
approx = ""

# Context percentage of usable space
pct = int((current_ctx / usable_ctx) * 100) if usable_ctx > 0 else 0

# Color for context percentage
if pct < 50:
pct_color = C_GREEN
elif pct < 80:
pct_color = C_YELLOW
else:
pct_color = C_RED

# Cost and stats
cost_data = data.get("cost", {})
cost = cost_data.get("total_cost_usd", 0)
lines_added = cost_data.get("total_lines_added", 0)
lines_removed = cost_data.get("total_lines_removed", 0)
api_duration_ms = cost_data.get("total_api_duration_ms", 0)

# Prompt: user@host:path
user = os.environ.get("USER", "?")
host = os.uname().nodename.split(".")[0]
cwd = os.getcwd()
home = os.path.expanduser("~")
if cwd.startswith(home):
cwd = "~" + cwd[len(home) :]

# === Build output ===

# Left side: prompt
left = f"{C_RED}{user}{C_RESET}@{C_MAGENTA}{host}{C_RESET}:{C_BLUE}{cwd}{C_RESET}"

# Right side: metadata
# Format lines changed: +added-removed
lines_str = f"{C_GREEN}+{lines_added:,}{C_RESET};{C_RED}-{lines_removed:,}{C_RESET}"

right_parts = [
f"{C_CYAN}{model}{C_RESET}",
f"in:{C_YELLOW}{fmt_k(total_in)}{C_RESET} out:{C_YELLOW}{fmt_k(total_out)}{C_RESET}",
f"ctx:{pct_color}{approx}{pct}%{C_RESET} {C_DIM}({fmt_k(current_ctx)}/{fmt_k(usable_ctx)}){C_RESET}",
lines_str,
f"{C_DIM}{fmt_duration(api_duration_ms)}{C_RESET} {C_MAGENTA}${cost:,.2f}{C_RESET}",
]
right = " | ".join(right_parts)

# Calculate padding to right-align metadata
left_len = visible_len(left)
right_len = visible_len(right)
padding = TERM_WIDTH - left_len - right_len

# Ensure at least 2 spaces between left and right
if padding < 2:
padding = 2

output = f"{left}{' ' * padding}{right}"
print(output, end="")


if __name__ == "__main__":
main()

note though: the prompt can’t be 100% accurate because, as we covered previously, anthropic is oddly incompetent when it comes to system and utility and infrastructure design. For example, here, you can run multiple claude commands, but they all appear to share the same global ~/.claude/settings.json which gets updated live as you run things, so if you have two terminals running claude with different /model settings, the last one to change /model overwrites your global “model” setting in the config file? wtf? (this is also why you can see things like “you have used -700k tokens!” in the UI when it gets confused about your current model, or when it aggressively compacts your 1 million model session at 150k tokens when it gets confused due to using one global state for indepedent concurrent processes always conflicting.) we see all over the claude code ecosystem they combine (and confuse) config files with metrics files where they dump live self-rewriting metrics into your static config files which is just… utter bullshit if you’ve ever designed systems before? you don’t add dynamically updating/changing/automated-writing stats into static user config profiles? wtf are they even doing over there? must be some meta-ethics brain where you just make everything self-rewriting json so all state is always outdated and corrupt and nobody can save or restore their own config files without breaking things because all files “Self-rewrite real time stats” inside of them. again, we say: wtf?

mesh image router

architecture

mesh image router

architecture

protocol

Private Control Protocol

Content Request Flow

Upload Flow

Multi-Process Architecture

Startup Sequence

Unix Socket Naming

mesh topology definitions

TOML Topology Format

Section Reference

[meta] - Topology Metadata

[s3] - Origin Storage

[services] - External Services

[[levels]] - Cache Hierarchy Tiers

[[levels.servers]] - Server Definitions

[levels.servers.redirect_target] - Client Redirect

[[levels.servers.public]] - Public Interfaces

[[levels.servers.private]] - Peer Control Channel

Topology Dataclasses

hotspot avoidance

Replication Lifecycle

all tests pass

stats

claude prompt

Similar Posts

`[meta]` - Topology Metadata

`[s3]` - Origin Storage

`[services]` - External Services

`[[levels]]` - Cache Hierarchy Tiers

`[[levels.servers]]` - Server Definitions

`[levels.servers.redirect_target]` - Client Redirect

`[[levels.servers.public]]` - Public Interfaces

`[[levels.servers.private]]` - Peer Control Channel