The cloud still hasn’t been solved

I’ll be the first to call out that this post is a little rant-y, but I think the points are very valid.

At the inception of every weekend project I find myself in the same conudrum: Where should I deploy it?

The fatigue of figuring out where to run my project borderline prevents me from building in the first place.

These projects would be so easy if I could write them with the same brevity of a local binary or script, but we live in a world where servers die and disks forget things.

The problem is that nothing that is easy solves this.

In my eyes, the largest obstacle to using them is… using them.

They have so many options due to needing to cater to everyone, that it starts to feel like a Rube Goldberg machine.

Of course they sell premium “we’ll manage everything” services, …

I’ll be the first to call out that this post is a little rant-y, but I think the points are very valid.

At the inception of every weekend project I find myself in the same conudrum: Where should I deploy it?

The fatigue of figuring out where to run my project borderline prevents me from building in the first place.

These projects would be so easy if I could write them with the same brevity of a local binary or script, but we live in a world where servers die and disks forget things.

The problem is that nothing that is easy solves this.

In my eyes, the largest obstacle to using them is… using them.

They have so many options due to needing to cater to everyone, that it starts to feel like a Rube Goldberg machine.

Of course they sell premium “we’ll manage everything” services, but they are haunted by the ghosts of IAM past, and are typically the most likely products to be murdered in the next blog post.

That leaves you with the primitives if you want reliability.

New cloud providers like Fly, Modal, Convex, Cloudflare (workers), Vercel are all taking their stab at an “opinionated cloud” (”OCloud”?). When ever I try to use one of these new clouds, I find hard-stop limitations and sharp edges, usually undocumented.

I know the providers here are starting with their niche, and working their hardest to build the best products they can. Unforuntely, their communication sometime gets a bit farther ahead than their product. Maybe we need “we’re not a good fit if…” pages? So take my rant with a grain of salt: I know things are constantly changing, and people are working really hard to close these gaps.

I’ve found the undocumented 32,768 connections limit to Durable Objects on Cloudflare, and undocumented R2 rate limits (really just suffocating the metadata layer at like 400req/sec per bucket). I’ve found previously undocumented max queue sizes in Modal for .spawn() (it’s 1M, documented now).

I guess when I build things, I heave data.

Private networking often doesn’t exist, or is unusably slow (≤2Gbit).

You get forced to play “update your image with our new SDK or we’re taking your service offline in 30 days” whack-a-mole (Modal).

You get forced into using a single language so that when you need to bring in another for package or feature requirements, you need to break down the walls of the garden.

Resource or duration restrictions on runtimes like memory limits of 64MB v8 runtime or 512MB NodeJS runtime, 10min duration max. Volumes are limited to 3,000 max disk iops (about 0.1% of a modern NVMe, a total joke).

Your object storage product has 3x more expensive egress than S3+Cloudfront, clients can only pull files at literally 500Kbit, yet uses S3 to back it? It also has no access controls, so anyone with the UUID can download the file.

“Yes, use our database! No, we don’t have unique indexes... No, you can’t specify the primary key…” Sorry, am I the one writing the database within my code now as a bunch of KV wrappers?

They dance around these limitations by shoving higher level primitives like workflows down your throat, the beautiful green rolling hills riddled with landmines. “Don’t forget about max workflow history size and length!” “Don’t you ever change the order of your functions, or we’ll throw determinism errors!”

These higher level abstractions are often just like the clouds themselves: They look nice on the surface, but once you enter, you find a lot of problems that make you want to turn and run the other way (or risk blowing up).

These providers tend to have the shakiest reliability, with transient errors like the underlying host dying far more often than a hyperscaler.

Observability is often insufficient, or inextensible to the point where you have to bring in dedicated observability providers just to learn and confirm it is indeed your cloud’s fault that your service stopped running (don’t worry, the status page won’t be updated for another hour).

Docs will be terribly out of date, even the landing page docs. You’ll follow the quickstart guide to the letter, and immediately hit bugs that Claude has to help you unravel.

Sometimes, the login page doesn’t work, and you have to contact one provider through the support of another provider that has shared Slack channels because to contact the offending platform’s support, you have to make a ticket through the dashboard.

It’s constant “I can feel the technical debt” in the product.

Out of the five or so projects I’ve deployed to fly.io, using fly.io templates from the most updated flyctl, zero of them have started correctly after the first fly launch.

Sometimes, machines would just go “Suspended” with no additional information.

Well, I’m not paying for support to tell me why their UI and API won’t tell me what’s wrong, so after abusing some YC channels for support, I heard back:

Hi Dan,

This would appear to be related to some temporary capacity issues in IAD, over the last week. When your app scales to zero with an auto_stop_machines setting, if there’s a lack of host capacity in the region we’re sometimes not able to bring up the Machine when the Proxy auto-starts it.

In this case, I’d recommend ether of two things:

set auto_stop_machines = "off" in your fly.toml 1.

scale your app to a nearby region with fly scale count

Wait… isn’t it your job to keep my apps running? Maybe move the machine? Maybe bring it up once you have capacity? It wasn’t attached to a disk, so there’s no technical reason this couldn’t have been immediately brought up on another host. I just had to restart it manually, and it came online.

And this was not their “Machines API”, this was “Apps”, the “just give us a Dockerfile and we’ll run it”. We had 2 instances running. Both died like this:

Note: This may have been solved. This happened over a year ago and I haven’t used fly for a new project since.

I think the fly.io team are infra geniuses, you can tell from their blog posts. They’ve made an incredibly enticing DX. But for a platform that advertises “Run any code fearlessly”, I have a lot of fear about my uptime. They’ve even acknowledged this historically, so I won’t call them ignorant.

Everything else is incredibly promising:

Managed Apps that (allegedly) keep your app instances up

Low-level Machine API

Volumes that can move with Machines causing very little downtime

Internal VPCs (private networking)

Managed certs

Pre-defined DNS (via app/machine metadata)

All these neoclouds are focusing on the simplest DX possible, but they maybe don’t realize that vendor lock-in is not a good DX. Here’s what I think is the ultimate DX:

Take my docker compose file that I used for local development, give the services with volumes bottomless, reasonably performant storage, and keep it available. Collect logs and metrics for me automatically, and make that searchable with **SQL (or in a really distant second, PromQL). Let me write my own alarms with said query language, and give me a few sane options like Slack, SMS, Webhook, and Email to alert me.

BTW: If it’s easy for me to understand and deploy, it’s easy for agents to understand and deploy too!

“It’s a lot more complicated than it sounds!”

Sure, but maybe not crazy complicated.

Distributed storage is incredibly convenient, and despite what some database people on X (Twitter) say, can be sufficiently performant.

We all know deep down that the hardest part of deploying anything, and why our code balloons 10x between PoC and MVP, is because of state (specifically, making it reliably durable).

How nice would it be if you could just use SQLite or RocksDB on disk instead of having to set up (or buy hosted) Postgres? Don’t even get me started on the olympics of setting up observability for a database that doesn’t have a native metrics shipping interface, much less a native Prometheus metrics endpoint.

Unless you really know what you’re doing (i.e. you’ve made your own distributed storage before), there are providers that will make this easy for you, like Archil and Simplyblock.

You don’t even need S3 at that point, just make a user_media directory and set up a route to serve it (ofc make sure you don’t have path traversal vulnerabilities).

Firecracker and cloud-hypervisor make it really easy to run OCI images dumped into rootfs’, and their CMD/ENTRYPOINT extracted to be part of the init.

Writing a basic (but robust) compute scheduler is not crazy hard. Definitely “give a cracked engineer no Cursor billing limit and a work week” doable.

Building container images is trivial. You don’t even need Docker to do it [1] [2] (and you can even use your compute scheduler to run the builders!)

Write some “here’s how you run production-ready {insert your favorite DB here} on our platform!” guides, and maybe even auto-detect the Prometheus exporters and create automatic dashboards and alerts for known services (Postgres, Redis, RedPanda, etc.)

Honestly the hardest part might be networking. Wireguard is too slow, and vxlan is scary (even SOTA LLMs haven’t helped me much here in my own tinkering).

Just give me a few knobs in the compose YAML to get from 0-10k req/sec. Once we get past that, our CEOs can play email chess to see whether you can give us the features and TCO to not move off to k8s on a hyperscaler.

I can think of a few OClouds that would retweet this being like “yes this is what we’re building at XYZ”.

Then why are you in my neocloud rant above?

Managed Apps that (allegedly) keep your app instances up

Low-level Machine API

Volumes that can move with Machines causing very little downtime

Internal VPCs (private networking)

Managed certs

Similar Posts