Introduction
Because of personal reasons, I have not been plugged into tech news for months. Imagine my surprise when one of the first stories I encounter as I briefly plug back in is about an outage in Cloudflare.
Well, people had opinions, and I have opinions on their opinions. And a solution to at least one of the problems.
In this post, Eduardo Bellani argues that the Cloudflare outage should not have happened. Why? Well, he implies that when companies become big enough, their systems are infrastructure and should be treated as such:
Yet ag…
Introduction
Because of personal reasons, I have not been plugged into tech news for months. Imagine my surprise when one of the first stories I encounter as I briefly plug back in is about an outage in Cloudflare.
Well, people had opinions, and I have opinions on their opinions. And a solution to at least one of the problems.
In this post, Eduardo Bellani argues that the Cloudflare outage should not have happened. Why? Well, he implies that when companies become big enough, their systems are infrastructure and should be treated as such:
Yet again, another global IT outage happened (dèjá vu strikes again in our industry). This time at cloudflare. Again, taking down large swaths of the internet with it.
…
FAANG-style companies are unlikely to adopt formal methods or relational rigor wholesale. But for their most critical systems, they should. It’s the only way to make failures like this impossible by design, rather than just less likely.
The internet would thank them. (Cloud users too—caveat emptor.)
The software industry is terrible at treating things like infrastructure. For example, physical infrastructure requires professional civil engineering done by professional, certified civil engineers, but we don’t have certified engineers, even though the physical infrastructure could be brought down by the software contained within.
“But,” you may protest, “Cloudflare is not public infrastructure.”
Perhaps. Nevertheless, Cloudflare is so big that it affects public infrastructure, and I would argue that’s the same thing. And Eduardo and I are not the only ones with this opinion.
But what could Cloudflare have done? Eduardo suggests three things, but I’ll only focus on the last one: “formally verified application code.” This is my focus because it is the one that got the most pushback on Hacker News, and the very one for which I have a solution.
The Impractical Solution
Formal verification, aka “formal methods,” is essentially about proving the absence of bugs in programs. This is in contrast to testing, which cannot do so:
Program testing can be used to show the presence of bugs, but never to show their absence!
If you’ve ever run into bugs, you may ask why the industry doesn’t do this.
The answer: it’s prohibitively expensive, including in opportunity cost. In fact, it’s so expensive that I do not think it should be done everywhere, and I want real and professional software engineering, which includes using formal methods.
But there is another problem:
….we know how to do it better, but nearly all of even the smartest engineers we can hire are not smart enough.
It turns out that the math required is hard.
But that’s not even the hard part! The hard part is getting the specification right. This is why Donald Knuth said,
Beware of bugs in the above code; I have only proved it correct, not tried it.
The difficulty is why someone suggested the same standard as avionics software instead.
“Wait, software in aircraft is not formally verified?” you ask.
Nope! It still has bugs, which have killed people.
And some call it useless.
So formal verification is too expensive, and it’s too hard. It is the solution to Cloudflare’s problem, but it is the impractical solution, the solution that is impossible to actually implement effectively.
The Practical Solution
And yet, the promise and allure of formal verification calls as a siren song. Could we still make it work?
Probably not “all-encompassing” formal methods. However, the Pareto Principle exists, so could we get 80% of the benefit from 20% of the effort?
I think so, and so do others:
The mistake most people make is they assume that formal methods is an all-or-nothing thing: either you’re not using formal methods or you have complete verification.
Perhaps we could prove one small property about a whole program. Or maybe we could prove several small properties about pieces of a program. Sure, it wouldn’t prove the absence of bugs, but it would prove the absence of those bugs, in those parts of the program, and that is almost as valuable.
The How
So! How do we do that?
As it turns out, I’ve already written about a solution, but let me rehash and expand on it here.
Local Reasoning
We need a programming system (i.e., a programming language) where every function acts as its own subprogram, where effects it the subprogram do not affect other subprograms, especially other concurrent subprograms.
“Why, Gavin?” you may ask.
Because if subprograms cannot affect other subprograms, than each subprogram essentially is its own standalone program. This means that we, or automatic tools, can look at a piece of code and prove small properties about it. This is called local reasoning.
Inductive Reasoning
Local reasoning is not the only kind we need, however. The other is inductive reasoning.
This means that all small properties that we try to prove can be proven by induction.
Why is this crucial? Because it is the other half of local reasoning.
Remember: we want to look at (or run tools on) one piece of code and be able to prove things about its whole subprogram. Well, its subprogram may contain other subprograms; if the properties were not inductive, we would no longer be able to prove them in the whole subprogram alone, and instead, we would have to prove the properties about the whole of the subprogram all at once.
This would cause the “all-or-nothing” mistake mentioned earlier.
However, if the properties are inductive, then when we look at (or run tools on) a piece of code, we merely need to check if all of its subprograms have the same property. If they do, then so long as the rest of the subprogram has the property, then the complete subprogram does.
Small Properties
And then, once we have local and inductive reasoning, we can prove things about one, two, or more of those subprograms. This is one piece of the puzzle, where we can prove small properties about pieces of a program.
“But, Gavin,” you may demand, “you’ve handwaved away what you mean by ‘small properties,’ and you haven’t solved the problem of getting specifications right!”
You would be right. Until now. For the definition of “small properties” is the solution to the difficulty of specification.
A small property is the existence or absence of any one single inductive aspect of a program that could cause one or more bugs.
This definition is probably confusing, so let me explain in plainer terms.
Think of bugs that you fix. Could you classify those bugs into different types of bugs? I think you can. For instance, here are a few classifications:
- Use of a null pointer/object/whatever.
- Memory leaks.
- Data races.
- Race conditions.
- Memory bugs (in C and C++).
- Etc.
Could you think of properties that might describe the existence or absence of those types of bugs? The first one is easy: if you have nullable types, then null bugs can exist, and if you have only non-nullable types, then null bugs cannot exist. Rust prevents data races, but not race conditions.
Those are the small properties I am talking about.
Some others could include:
pure: a subprogram does not modify any of its memory after initialization and has no side effects.total: a subprogram is guaranteed to terminate.no_alloc: a subprogram does not allocate memory.no_panic: a subprogram does not panic/crash.
Proofs
Now, how does this solve the problem of specification difficulty?
Well, I’ve very carefully added “run tools on” every time I talk about looking at a piece of code. This is an essential bit. Every property, in order to be proven, must have some proof. In all cases, I think that proof should come from a tool.
I can imagine you rolling your eyes. “Great, Gavin. That just turns formal verification into the creation of a million finicky tools.”
Not quite, but to explain, I need to back up.
See, there is one problem I have not mentioned: the specification must match the code.
Yet my solution also handles both of the remaining problems, that of specification mismatch and specification difficulty.
The only way to make sure the specification matches the code is to prove the property about the generated code, and the best place to do that is prove it in the thing that actually generates the generated code: the compiler. Yes, this means that the language has to have this type of formal verification builtin. Nevertheless, if the properties are proved there, in the compiler, on the generated code, then the code will match the specification, barring compiler bugs.
And quite frankly, the compiler should be written in its own language with as many small properties proven about it as possible.
Thus, the required “tools” will merely be compiler passes.
Which means that the compiler needs to be able to run custom passes, but that’s easy.
I think the passes should be run on the compiler’s intermediate representation (IR), the internal data format designed for easy processing. This will make them easier to write.
Yes, this means that, technically, the final generated code will not be proven, but that could be done by completely and formally verifying the compiler’s IR-to-machine code subprogram. That’s a much easier gap to bridge than doing the same thing for every individual program.
And some of them are easy:
no_null: Check that every type is not nullable.no_panic: Requireno_null, and check thatpanic()and all equivalents are not used.no_alloc: Check that all allocation routines are never used.- Etc.
Language
Of course, if the compiler needs to be involved, this is not possible without direct support from the language as well. And I am making that language.
That language is called Yao, and this is the list of things it has to do:
- Guarantee that destructors will run.
- Use Dynamic Restricted Structured Concurrency.
- Support adding properties to function declarations.
- Ensure that compiler passes for verification are run on all such marked functions.
- Ensure that properties are restrictive, not liberating.
I need to expand on that last item because you may say, “But Gavin! Properties will infect the call stack and will require a lot of work to satisfy the compiler!”
No, they won’t. Well, they will, but they infect the call stack downward, which means that it infects callees, not callers.
“That doesn’t make a difference, Gavin.”
It makes all the difference.
See, a function “controls” who it calls, but it does not control who calls it. In other words, if you write the function, you can easily ensure you only call other no_panic functions. However, assuming that properties were liberating, and we had panic instead of no_panic, you could not easily ensure that only other panic functions call your panic function. This means that it is easier to add functions with these restrictive properties than to add functions with liberating properties.
What are liberating properties? They are properties that make a function more powerful, not less.
The canonical example is probably async.
If I am right, then yes, this means that async should have been the default, and sync should have been the non-default function attribute. However, the blocking problem of calling a sync function in an async one would still remain.
I hate async.
However, there are others; just reverse all of the properties I have used as examples.
Examples
So what would using properties look like when using Yao?
Rust Panic
Let’s start with no_panic, since it was ultimately a panic in Rust that took down Cloudflare. Plus, it happened in a config parser. I can think of many ways to make such code never panic, and I’m not the brightest bulb.
It was updating the config, so they could have just fallen back to the previous version of the config with a log message.
The config parsing subprogram could have been defined like this:
no_panic
fn parse_config(f: Path) -> Config;
Easy.
And if you want to make sure your entire codebase is free from panics (minus OS shenanigans), just wrap everything in no_panic and do the incredible work of removing all panics.
But note again that this work is not viral; it is only the callees that need to change.
Java Exceptions
This is an especially interesting one, and it will make the restriction requirement clearer.
However, I think that there is another canonical example, which most people don’t think about: Java checked exceptions.
See, Java checked exceptions are liberating properties; a method cannot throw any checked exception except what is listed.
What if methods could throw any exception, and the exceptions list removed exceptions from the list of possible exceptions?
“That would just mean that the whole program would need to be changed, Gavin.”
Au contraire! If every method can throw anything, then not a single method needs to be changed. We know this because it already exists: Java runtime exceptions. Yep! Any method can throw runtime exceptions, but you don’t have to go through your source code labeling every method.
Imagine that, instead of allowing methods to add exceptions they could throw, they had a no_throw clause that added exceptions that they could not throw. Any other method could call those methods without worry, but there would still be a check to ensure that only a certain list of exceptions could be thrown.
Implicit in this is that the methods could not throw any exception that is not imported and that is not thrown by callees.
Would this solve the problem? I think so. After all, if “the root of all evil” is “enforced handling,” then allowing all methods to throw means that enforced handling disappears.
That same author also claims that Java checked exceptions “expos[e] implementation details in the signature.” Well, if exceptions were restrictive, than the no_throw clause would expose the opposite: what exceptions are not thrown.
Oh, and if this were the case, then Java would have no need of a checked/unchecked dichotomy; all would be unchecked.
Lock Poisoning
Let’s take another example from Rust: lock poisoning.
Actually, for this example, I need to add a feature to Yao: types that require certain properties during their existence. I don’t think this would be too hard to add to the language, but it would be very useful, as we will see.
Specifically, this would be an extension of Design by Contract, a form of which Yao will have. A better form.
Let’s take a (hypothetical) type and make it require no_panic:
requires(no_panic)
MutexGuard#(T: type) -> type
{
// ...
}
Okay, I admit that this is no hypothetical type; it exists in Rust, and it it returned when a mutex is locked. Its lifetime defines how long the mutex is locked; when it is destroyed, the mutex is unlocked.
It also has a problem: lock poisoning.
Lock poisoning happens when a panic happens in a critical section, i.e., when a mutex is locked and when a MutexGuard<T> exists.
There are those that defend it, but I am of the opinion that “poisoning is a bad solution to a very real problem.”.
So…what if Rust could simply say that a MutexGuard<T> can only exist in code that cannot panic? The “very real problem” simply vanishes. Just like that. And the compiler guarantees that the problem does not exist!
Such is the power of restrictive properties.
Conclusion
With all of that said, do I think the world will move to using them? Oh, absolutely not! People aren’t willing to pay my salary to maintain a critical project, one used by at least one big company, so they sure as not will not pay my salary to create a new language.
But if they did, I’d be willing. And it would be the greatest advancement to formal verification use in industry.