(Originally posted October 30, 2018)
A common pattern used when communicating with external services is to retry a call that fails. Stripped of bells and whistles, the typical retry loop looks like this:
result = makeServiceCall(parameters)
numRetries = 0
while !result.success && numRetries < MAX_RETRIES
// insert delay before retry
++numRetries
result = makeServiceCall(parameters)
end while
if (!result.success)
// Raise error.
We can quibble about the specific implementation, but the general idea is the same: make multiple attempts to get a success response from the service, inserting a delay between calls. The delay can be a fixed amount of time, an exponential fallback with jitter, etc., and you can include all kinds of other logic to improve things, but it all boils down to essentially the same thing.
Under normal circumstances, of course, service calls don’t fail, and this logic works very well. But if the service call failure rate increases, some very bad things can happen.
Imagine a critical service for which you have a contract (a service level agreement, or SLA) that guarantees an average 10 ms response time for up to 1,000 transactions per second (TPS). Remember, though, that this is a shared service. There are many other clients who have similar contracts: 10 ms response time for X TPS. Your application calls that service perhaps 900 times per second, on average. There will be brief periods when your call rate will exceed 1,000 TPS, but that’s okay because the service is scaled to handle large amounts of traffic from many clients. Say the service can guarantee that 10 ms response time for a total of 1,000,000 TPS from all of its clients, combined. Short-term bursts of excess traffic from a few clients aren’t a problem.
Even if calls to the service exceed 1,000,000 TPS, the likely result at first will be increased response time: perhaps latency increases by 50% with a sustained traffic increase of 10%, and doubles when traffic is 20% above the configured maximum. The specific breaking point differs for every service, but in general latency increases non-linearly with the call rate.
Clients, of course, won’t wait forever for a response. They typically configure a timeout (often two or three times the SLA), and consider the call a failure if it takes longer than that. Not a problem with this retry logic: just delay a bit and try again.
As I said above, this kind of thing works fine under normal conditions. But in a large system, lots of things can go wrong.
Imagine what would happen if the service starts getting 1,500,000 requests per second: a sustained 50% increase in traffic. Or one of the service’s dependencies can’t meet its SLA. Or network congestion increases the error rate. Whatever the cause, the service’s failure rate increases, or latency increases beyond the timeout value set by clients. Whatever the cause of the service’s distress, your application blindly responds by sending another request. So if your MAX_RETRIES
value is two, then you’ve effectively tripled the number of calls you make to the service.
The last thing a service under distress needs is more requests. Even if your application is not experiencing increased traffic, your retries still have a negative effect on the service.
Some argue that services should protect themselves from such traffic storms. And many do. But that protection isn’t free. There comes a point when the service is spending so much time telling clients to go away that it can’t spend any time clearing its queue. Not that clearing the queue is much help. Even after the initial problem is fixed, the service is swamped with requests from all those clients who keep blindly retrying. It’s a positive feedback loop that won’t clear until the clients stop calling.
The retry loop above might improve your application’s reliability in normal operation. I say “might” because most applications I’ve encountered don’t actually track the number of retries, so they have no idea if the retry logic even works. I’ve seen the following in production code:
- A retry loop that always made the maximum number of retries, even if the initial call succeeded.
- Retry logic that never retried. That code was in production for two years before anybody realized there was a problem. Why? Because the service had never failed before.
- Retry logic that actually did the retries but then returned the result of the first call.
- Infinite retry. When a non-critical service went down one day, the entire site became inoperable.
As bad as it is that many programmers apparently don’t test their retry logic, even fewer monitor it. In all the applications I’ve seen with retry logic, only a handful can tell me how effective it is. If you want to know whether your retry logic is working, you have to log:
- The number of initial calls to the service.
- The number of initial call failures.
- The total number of calls to the service (including retries).
- The number of call successes (including success after retry).
From those numbers, you can determine the effectiveness of the retry logic. In my experience, the percentage of initial call failures to any service under normal operation is less than 1%, and retry succeeds in fewer than 50% of those cases. When a service is under distress and the initial failure percentage gets above about 10%, retry is almost never successful. The reason, I think, is that whatever condition caused the outage hasn’t cleared before the last retry: service outages last longer than clients are willing to wait.
For the majority of applications I’ve encountered, retry is rarely worth the effort it takes to design, implement, debug, test, and monitor. Under normal circumstances it’s almost irrelevant, maybe making the difference between 99% and 99.5% success rate. In unusual circumstances, it increases the load on the underlying service, and almost never results in a successful call. It’s a small win where it doesn’t matter, and a huge liability when it does matter.
If you have existing retry logic in your code, I encourage you to monitor its effectiveness. If, like me, you discover that it rarely provides benefit, I suggest you remove it.
If you’re considering adding retry logic to your code, be sure to consider the potential consequences. And add the monitoring up front.