Retry in Distributed Systems — How Production Systems Recover From Temporary Failures (opens in new tab)
Not every failure is permanent. This is something I didn't think about before. When something fails in my app, my first thought was something broke, fix it. But when I started learning how distributed systems actually work, I realized that some failures are not really failures. They're just temporary. Network glitch. API timeout. A service that just restarted. Rate limiting kicking in. These are all failures but they last for a very short time window. If your system tries the same operation a...
Read the original article