Re-thinking the retry

(Originally posted October 30, 2018)

A common pattern used when communicating with external services is to retry a call that fails. Stripped of bells and whistles, the typical retry loop looks like this:

    result = makeServiceCall(parameters)
    numRetries = 0
    while !result.success && numRetries < MAX_RETRIES
        // insert delay before retry
        ++numRetries
        result = makeServiceCall(parameters)
    end while

    if (!result.success)
        // Raise error. 

We can quibble about the specific implementation, but the general idea is the same: make multiple attempts to get a success response from the service, inserting a delay between calls. The delay can be a fixed amount of time, an exponential fallback with jitter, etc., and you can include all kinds of other logic to improve things, but it all boils down to essentially the same thing.

Under normal circumstances, of course, service calls don’t fail, and this logic works very well. But if the service call failure rate increases, some very bad things can happen.

Imagine a critical service for which you have a contract (a service level agreement, or SLA) that guarantees an average 10 ms response time for up to 1,000 transactions per second (TPS). Remember, though, that this is a shared service. There are many other clients who have similar contracts: 10 ms response time for X TPS. Your application calls that service perhaps 900 times per second, on average. There will be brief periods when your call rate will exceed 1,000 TPS, but that’s okay because the service is scaled to handle large amounts of traffic from many clients. Say the service can guarantee that 10 ms response time for a total of 1,000,000 TPS from all of its clients, combined. Short-term bursts of excess traffic from a few clients aren’t a problem.

Even if calls to the service exceed 1,000,000 TPS, the likely result at first will be increased response time: perhaps latency increases by 50% with a sustained traffic increase of 10%, and doubles when traffic is 20% above the configured maximum. The specific breaking point differs for every service, but in general latency increases non-linearly with the call rate.

Clients, of course, won’t wait forever for a response. They typically configure a timeout (often two or three times the SLA), and consider the call a failure if it takes longer than that. Not a problem with this retry logic: just delay a bit and try again.

As I said above, this kind of thing works fine under normal conditions. But in a large system, lots of things can go wrong.

Imagine what would happen if the service starts getting 1,500,000 requests per second: a sustained 50% increase in traffic. Or one of the service’s dependencies can’t meet its SLA. Or network congestion increases the error rate. Whatever the cause, the service’s failure rate increases, or latency increases beyond the timeout value set by clients. Whatever the cause of the service’s distress, your application blindly responds by sending another request. So if your MAX_RETRIES value is two, then you’ve effectively tripled the number of calls you make to the service.

The last thing a service under distress needs is more requests. Even if your application is not experiencing increased traffic, your retries still have a negative effect on the service.

Some argue that services should protect themselves from such traffic storms. And many do. But that protection isn’t free. There comes a point when the service is spending so much time telling clients to go away that it can’t spend any time clearing its queue. Not that clearing the queue is much help. Even after the initial problem is fixed, the service is swamped with requests from all those clients who keep blindly retrying. It’s a positive feedback loop that won’t clear until the clients stop calling.

The retry loop above might improve your application’s reliability in normal operation. I say “might” because most applications I’ve encountered don’t actually track the number of retries, so they have no idea if the retry logic even works. I’ve seen the following in production code:

  1. A retry loop that always made the maximum number of retries, even if the initial call succeeded.
  2. Retry logic that never retried. That code was in production for two years before anybody realized there was a problem. Why? Because the service had never failed before.
  3. Retry logic that actually did the retries but then returned the result of the first call.
  4. Infinite retry. When a non-critical service went down one day, the entire site became inoperable.

As bad as it is that many programmers apparently don’t test their retry logic, even fewer monitor it. In all the applications I’ve seen with retry logic, only a handful can tell me how effective it is. If you want to know whether your retry logic is working, you have to log:

  • The number of initial calls to the service.
  • The number of initial call failures.
  • The total number of calls to the service (including retries).
  • The number of call successes (including success after retry).

From those numbers, you can determine the effectiveness of the retry logic. In my experience, the percentage of initial call failures to any service under normal operation is less than 1%, and retry succeeds in fewer than 50% of those cases. When a service is under distress and the initial failure percentage gets above about 10%, retry is almost never successful. The reason, I think, is that whatever condition caused the outage hasn’t cleared before the last retry: service outages last longer than clients are willing to wait.

For the majority of applications I’ve encountered, retry is rarely worth the effort it takes to design, implement, debug, test, and monitor. Under normal circumstances it’s almost irrelevant, maybe making the difference between 99% and 99.5% success rate. In unusual circumstances, it increases the load on the underlying service, and almost never results in a successful call. It’s a small win where it doesn’t matter, and a huge liability when it does matter.

If you have existing retry logic in your code, I encourage you to monitor its effectiveness. If, like me, you discover that it rarely provides benefit, I suggest you remove it.

If you’re considering adding retry logic to your code, be sure to consider the potential consequences. And add the monitoring up front.

Rock Me Baby, ’til My Feet Come Off

Very early in our relationship, Debra and I got silly one night and were making fun of country music. I don’t know what prompted it but at some point I came up with a song title: “Rock Me Baby, ’til My Feet Come Off.” We couldn’t stop laughing. I know, it was one of those, “you had to be there” moments.

That was 30 years ago and I still think that somebody should come up with lyrics. It could be a hit!

I’ve been poking at Microsoft’s Bing Chat (powered by ChatGPT) and Google’s Bard recently, so I thought I’d see what they can do with that title. My prompt to each one was ‘write lyrics for a country song called “Rock Me Baby, ’til My Feet Come Off“.

Bing Chat

Here’s a country song called “Rock Me Baby, ’til My Feet Come Off”:

Verse 1: I’ve been working all day long My feet are tired and my back is sore I need someone to take me home And rock me baby, ’til my feet come off

Chorus: Rock me baby, ’til my feet come off Hold me tight and never let me go Rock me baby, ’til my feet come off And I’ll love you more than you’ll ever know

Verse 2: I’ve been waiting for this moment To hold you close and feel your love I need someone to take me home And rock me baby, ’til my feet come off

Chorus: Rock me baby, ’til my feet come off Hold me tight and never let me go Rock me baby, ’til my feet come off And I’ll love you more than you’ll ever know

Bridge: I don’t need no fancy things Just your love is all I need So come on baby, take my hand And rock me baby, ’til my feet come off

Chorus: Rock me baby, ’til my feet come off Hold me tight and never let me go Rock me baby, ’til my feet come off And I’ll love you more than you’ll ever know

Outro: Rock me baby, ’til my feet come off Rock me baby, ’til my feet come off.

Google Bard


(Verse 1) Rock me baby, ’til my feet come off
I’m so tired, I can’t stand no more
I’ve been working all day, and I’m beat
Just rock me baby, and make me feel complete

(Chorus) Rock me baby, ’til my feet come off
I’m so tired, I can’t sleep at night
I just lay there and toss and turn
And dream of being rocked in your arms

(Verse 2) Rock me baby, ’til my troubles are gone
I’m so worried, I can’t catch my breath
I’ve got a lot on my mind, and I don’t know what to do
Just rock me baby, and make all my troubles new

(Chorus) Rock me baby, ’til my feet come off
I’m so tired, I can’t sleep at night
I just lay there and toss and turn
And dream of being rocked in your arms

(Bridge) Rock me baby, ’til I can’t feel my bones
I’m so worn out, I’m just about gone
But I know that if you rock me long enough
I’ll be all better, and I’ll be able to go on

(Chorus) Rock me baby, ’til my feet come off
I’m so tired, I can’t sleep at night
I just lay there and toss and turn
And dream of being rocked in your arms

(Outro) Rock me baby, ’til the morning light
I’m so tired, I can’t see the night
Just rock me baby, and make me feel all right
I’ll be your baby, and you’ll be my knight

Well of course they’re terrible. But then, it’s country music.

I don’t think music writers have anything to worry about quite yet . . .

No, AI won’t take your job

I can’t tell how serious writers and actors are about their concerns with AI. They’re asking for some unspecified safeguards on the use of AI, or regulations, or something. In general, some language to assure them that their jobs will not be taken over by “AI.”

I think it’s ridiculous, but perhaps I’m attributing to the writers and actors things that have been overblown by the media or by the general public’s hysterical reaction to anything that somebody calls “AI.” Or algorithms in general. As far as all too many people are concerned, any “algorithm” is automatically evil and out to do us harm.

I base my ridicule on three things. First, people have been protesting new technology since the dawn of new technology. Two hundred years ago, the original Luddites destroyed equipment in textile mills in protest of automation, but they weren’t the first to protest automation. Strangely enough, the machines didn’t put them out of work. And yet protests against automation were common throughout the industrial revolution and continue to this day. Computers, for example, were going to put armies of clerical workers out of a job. But now, 70 years into the computer revolution, there are more clerical jobs than ever. There are cases in which automation has made certain jobs irrelevant, but it doesn’t happen overnight. And there continues to be need of the replaced skill for some time.

Second, the idea of artificial intelligence replacing a journalist, screenwriter, actor, programmer, or any other skilled human is laughable. As I’ve mentioned before, ChatGPT (which I think is what has gotten everybody up in arms) and similar tools are just mimics: they rearrange words in a blender and spit them out semi-randomly, following rules that dictate the form, but with no regard to substance. And that’s just regurgitating stuff that’s already known. Attempts at AI creativity–having the computer create something novel–are comically terrible. The idea of a generative AI replacing a human writer just isn’t realistic. Certainly not within my lifetime, and likely not within the lifetime of anybody born today.

Third, if somebody does develop an AI that can produce objectively better news stories, movie scripts, novels, acting performances, computer programs, etc. than a human, then more power to them! As long as I’m informed or entertained, I don’t particularly care who or what created the article or the performance. We all benefit from better expression of ideas, and those whose skills are better performed by artificial intelligence will either find something else to do that is not yet AI-dominated, or will be able to peddle their skills in a smaller and often more lucrative market. For certain, any actor who’s pushed out of the big studios by this future fanciful AI will have plenty of opportunity in smaller studios that can’t afford or don’t want to use AI technologies.

Yes, there is some justifiable concern that studios will use currently available techniques, and new techniques going forward, to unscrupulously manipulate an actor’s voice, image, or performance to say things that the actor never intended or agreed to. We’ve all seen those agreements that allow the companies to use our likeness in any way, shape, or form, in perpetuity. Those types of clauses should have been eliminated from contracts decades ago, and I support those who are trying to address that situation now. But beyond that, the fears about AI replacing skilled workers, especially skilled creatives, are unfounded.

Credit scoring is a scam

I do a quick check of my credit score approximately on a monthly basis. I’m not sure exactly why I do that, considering that I’m comfortable financially, my life is pretty much the same it’s been for some time, and I don’t foresee any major changes.

I have one credit card, and a five year auto loan that has about a year and a half left. That is the extent of my outstanding credit. My monthly purchases on the credit card fall within a very narrow range and I pay it off every month. The auto loan is current and I’ve never been late with a payment.

Given the above, one would expect my credit score to remain the same, with a few minor fluctuations on a month-to-month basis depending on how much I charge on my credit card. And it does, mostly. But in the last month my credit score dropped 32 points (almost 4%), and I have no idea why. There have been no recent inquiries, no new accounts, no missed or late payments, and the outstanding credit card balance is within the normal range of what I charge on a monthly basis. In short, nothing has changed.

At least, nothing on my side of the equation has changed. Whether something has changed in the way the Vantage credit score is computed is an open question. It’s an open question because only the people computing the credit score know how it works. We’re being graded on hidden criteria. The scoring system is proprietary. Most of us who are affected by it have no way of knowing how our scores are computed or how our individual financial decisions will affect that score. And yet, credit score is a huge determining factor in one’s ability to get a good rate on a house or car loan–or to obtain a loan at all.

Sure, we know some general rules: don’t have “too much” available credit or use “too much” of it. Don’t have “too many” inquiries or open “too many” new accounts. But be sure to have “enough” available credit and be sure to use “the right percentage” of it to show that you know how to handle the responsibility. And be sure to have a first mortgage because that shows your ability and willingness to handle long-term credit commitments.

You paid off your home mortgage? Shame on you! We’re gonna subtract points because “you don’t have enough experience with a first mortgage.” That’s right, paying that 30-year loan off in 20 years is a bad thing.

There are general guidelines (Vantage, for example, says that I should have at least $50,000 in available credit and using up to 10% of it is “Excellent”, which is ridiculous), but there’s no hard detail that says exactly what the grading criteria are and how they interact. As far as consumers are concerned, a bunch of numbers go into a black box and the MagicScore™ comes out. We don’t get to know how it all works, and apparently we’re not even told if the innards change and suddenly we’re being graded completely differently.

In a just world, we’d at least know how the score used to decide whether we get a loan, and at what rate, is computed. I realize that credit score isn’t the only determining factor, but it can be a deal killer (i.e. everything else looks good, but the MagicScore™ is off).

I’ve long held that credit scoring, as currently practiced in the United States, is essentially arbitrary. Incidents like my score’s recent drop, with no indication of why it dropped, strengthens that perception. The companies that market the scoring systems (Vantage is a joint effort of the three major credit bureaus, FICO is a product of the Fair Isaac Corporation) and the financial institutions that use them, seem to have absolutely no desire or incentive to tell us how the scoring works. And consumers have nowhere else to go: buy into the credit score scam or go visit Guido.