Fun with the piggy bank

Debra and I have a jar in which we put our spare change. When we fill it, about once every two years or so, we take the jar to the bank, where they pour the contents into their coin counter/sorter. We take the cash and splurge on something. It’s not a huge amount of money: usually around $200.

When the jar fills up, I like to dump it on the living room floor and count the money. Sifting through those coins brings back memories of my grandmother, who was a coin collector. I’m also reminded of watching the pinball machine service guys coming to Dad’s bowling center. And the brief period when I worked for my dad at the vending company, or emptying the coins from the vending machines we owned in our motels. Yeah, sifting through coins is a happy trip down memory lane for me.

While I was counting today, I got to thinking about the expected percentages of the different coins, and whether I could accurately estimate the amount of money in a change jar just by counting a single type of coin. So if I had, say, $100 in quarters, then how much money should I expect to be in the jar? How would I go about estimating that?

First, I have to make some assumptions.

  1. When I purchase something, I always get the optimum number of coins in change. That is, if I pay 75 cents for something, I’ll get a quarter back: never two dimes and a nickel, etc. This, in my experience, is a pretty safe assumption.
  2. I always pay for things in even dollar amounts. That is, if something costs $2.03, I give the cashier an even dollar amount and expect change. I never give, say, $5.03, and expect $3.00 back. I actually do that sometimes, but most often I don’t have any change in my pocket.
  3. I never take money out of the jar except for when I’m emptying it completely. That’s almost certainly true these days, because I don’t use the vending machines at work anymore. There’s no need for me to filch quarters.
  4. Each change amount is equally likely. That is, I’m just as likely to get 23 cents in change as I am to get 99 cents in change. This is probably a bad assumption. Cashiers often give me a quarter in change when the actual amount due is 24 cents. Rounding up one or two cents is common. Although this assumption doesn’t hold true, the percentage error is causes is likely small.

With those assumptions, how do I estimate how many coins of each type are in the jar?

The standard “counting up” method of making change is proven optimum. That is, if somebody pays $1.00 for an item that costs $0.33, then the change is two pennies, a nickel, a dime, and two quarters. If we write out the coins required to make change for every amount from $0.01 to $0.99, we end up with:

200 pennies (42.55% of coins)
40 nickels (8.51%)
80 dimes (17.02%)
150 quarters (31.91%)

$100 is 400 quarters. Given 400 quarters, I would expect the jar to hold (400/0.3191), or  1,253 coins, with this distribution:

533 pennies ($5.33)
106 nickels ($5.30)
213 dimes ($21.30)
400 quarters ($100.00)

(You’ll notice there’s a rounding error: the count of coins above is only 1,252.)

That gives me a total of $131.93. So I should be able to count the value of the quarters, add 30%, and have a pretty decent estimate of the amount of money in the jar.

I just happened to write down the contents of our change jar:

840 pennies ($8.40) (38.85% of coins)
330 nickels ($16.50) (15.26%)
402 dimes ($40.20) (18.59%)
590 quarters ($147.50) (27.29%)

That’s 2,162 coins, with a total value of $212.60. If I add 30% to the value of the quarters, I get $191.75, which is about 10% below the actual amount of money in the jar. The percentages of dimes and pennies are pretty close. The jar is light on quarters, and heavy on nickels.

A 10% error by counting only 27% of the coins, though, isn’t bad.

If you have a change jar and are willing to play along, I’d like to know your results. You don’t have to give me actual dollar amounts: just tell me the percentage error, and whether it’s high or low.

Something else I just noticed. The theoretically perfect jar has 1,252 coins, with a total value of $131.93. Our jar had 2,162 coins with a total value of $212.60. In both cases, if you divide the number of coins by 10, you come very close to the total value in the jar. $125.20 is only about 5% shy of the $131.93 actual total. And $216.20 is only 1.5% higher than the actual total in of $212.60 in our jar. Freaky.


That’s not what negative feedback means

On January 30, NPR’s All Things Considered ran a segment called How The Trump Administration’s Tariffs On China Have Affected American Companies. In it, NPR’s Ari Shapiro was asking questions of Bloomberg reporter Andrew Mayeda. There was this exchange.

SHAPIRO: When you look at the scale of the impact, is this more along the lines of an annoyance and inconvenience, or is it a real economic impact, something that could lead to slower economic growth, maybe even a recession down the line? How severe is it?

MAYEDA: If you actually look at the big-picture forecasts of the impact – for example, the IMF says that if we have a worst-case trade scenario, the global economy is going to be less than 1 percent smaller than what it otherwise would’ve been. That is not catastrophic. I think what people are concerned about is that there’s some type of confidence shock. That is to say that, you know, businesses start investing less. Consumers start spending less. And it gets into this negative feedback loop where reducing confidence leads to slower growth.

Hate to tell you this, Mr. Mayeda, but what you describe here is a positive feedback loop.

Negative feedback occurs when some function of the output of a system, process, or mechanism is fed back in a manner than tends to reduce the fluctuations in the output.” Furthermore:

Whereas positive feedback tends to lead to instability via exponential growth, oscillation or chaotic behavior, negative feedback generally promotes stability. Negative feedback tends to promote a settling to equilibrium, and reduces the effects of perturbations. Negative feedback loops in which just the right amount of correction is applied with optimum timing can be very stable, accurate, and responsive.

Sadly, this isn’t the only instance I’ve encountered recently of reporters misusing the term. In all cases, reporters are characterizing the feedback as negative or positive based on the outcome. To them, if the outcome is bad, then it’s a negative feedback loop. If the outcome is good, then it’s a positive feedback loop. That’s not the way it works.

Negative feedback loops typically lead to stable systems: generally a positive result. A positive feedback loop tends to result in a system fluctuating wildly or going completely out of control: a generally negative result.

Reporters, please do a little research before using terms that you’re not familiar with. It’ll save you a lot of embarrassment, and prevent you from confusing people.

The Trump-McConnell shutdown

As the current government shutdown approaches a month, little has changed. Democrats blame President Trump for the impasse, Trump and his supporters like to blame Democrats even though the president himself said, “I will shut it down,” and pretty much everybody else blames general government dysfunction.

The truth, though, is that there’s a third party involved:  Senate Majority Leader and oathbreaker extraordinaire, Mitch McConnell, who is once more derelict in his duty.

Our system of passing legislation is supposed to be pretty simple. For legislation that involves funding the government, the House passes a bill and sends it to the Senate. The Senate debates that bill and either passes or rejects it. If it’s rejected, typically there is a conference committee in which members of the House and Senate work out disagreements. Eventually, either the bill passes or is finally rejected. If the bill passes both houses of Congress then it goes to the president, who has two choices: 1) Sign the bill, making it law; 2) Veto the bill. In the case of #2, Congress can elect to override the president’s action by a two-thirds vote. If two-thirds of both houses of Congress vote for the bill after the president has vetoed it, the bill becomes law.

If Congress were to pass a spending bill and send it to the president, Donald Trump will have to take action: sign the bill or veto it. Neither of those actions would be beneficial to the president.

If President Trump were to veto the bill, then he would have to take full responsibility for the shutdown. He could no longer blame the situation on Democrats. It would be the same as standing up and saying, “I believe that it is more important to fund my wall than it is for government to function normally.” Plus, there’s a slight possibility that two-thirds of Congress would vote to override his veto, making him look like a fool. Whereas many people think that ship has sailed, if Congress were to override his veto even Trump would see himself as a fool. And weak.

If the president were to sign the bill, something I can’t see happening, all his bluster over the last month or so would look foolish. He’d be excoriated for “caving,” his detractors would ridicule him to no end, and his base would probably condemn him as a traitor.

In short, there is no way President Trump comes out looking good if Congress presents him with legislation that doesn’t fund his border wall.

Make no mistake, Trump painted himself into this corner. He made an ultimatum, fully expecting Congressional Democrats to cave. They didn’t, and now he’s in a tough spot. The only way he can win is if the Democrat-controlled House agrees to fund his wall, and there is almost no incentive for them to do so. As much as he tries, Trump can’t deflect responsibility for the shutdown that he instigated, and the longer it drags on, the more people blame the current situation on him.

So what does this have to do with Mitch McConnell? As Senate Majority Leader, Mitch McConnel has absolute control over what legislation is debated in the Senate. Nothing gets heard without his approval. And for reasons I cannot fathom, Mitch McConnell has made himself Donald Trump’s protector. In 2016, McConnell prevented the Senate from holding hearings on President Obama’s Supreme Court nominee. In doing so, McConnell violated his Oath of Office which says, in part, “and that I will well and faithfully discharge the duties of the office on which I am about to enter.” McConnell very publicly refused to do his duty. There is no interpretation of that Oath that allows him to refuse just because it would be politically inconvenient.

In the current situation, McConnell knows that the Senate might just pass a bill that does not fund the wall, putting the president in a no-win situation. So he just doesn’t allow the Senate to debate the bill. I guess it doesn’t cost McConnell anything to do this: he’s already shredded his integrity. My only question here is why the rest of the senators don’t kick him to the curb. Allowing Mitch McConnell, a man who wouldn’t know integrity if it jumped up and slapped him across the face, to represent the United States Senate makes them all look bad.

I suppose I do have one other question: What power does Donald Trump hold over Mitch McConnell to make him act this way?

The president would have you believe that he’s fighting the good fight in the name of National Security. It’s all a smoke screen to hide the fact that he is vulnerable and fully dependent on an unscrupulous Senate Majority Leader. Trump’s supporters, even the few who know the truth of his powerlessness, eat it up and will continue to do so as long as he keeps up the bluster. Democrats are going to hate him regardless, and independents have long ago dismissed him as a fool. It costs the president nothing unless the Senate replaces McConnell with somebody worthy of the title so that Congress can get back to doing its job. Or unless McConnell somehow gets a better offer. Expecting him to find the shreds of his discarded integrity is laughably naive.

The shutdown will go on until one of these things happens, in order by increasing likelihood:

  1. Trump gives up.
  2. McConnell allows the Senate to debate a spending bill that does not fund Trump’s wall.
  3. The Senate kicks McConnell to the curb.
  4. The House passes a bill that partially funds Trump’s border wall.
  5. Trump, like he’s done so many times in the past with other things, conveniently forgets about the shutdown and finds some way to spin things for his supporters.

I consider the last two to be almost equally likely.

Whatever the case, Trump doesn’t win here. With the first three options, he’s revealed as a weak fool. With the fourth, he retains some dignity, but will have to swallow some pride because he didn’t get exactly what he wanted. The fifth option is a defeat and he loses some supporters, but his inability to admit defeat protects his fragile ego. To him, it’ll be as though the shutdown never happened.

Re-thinking the retry

A common pattern used when communicating with external services is to retry a call that fails. Stripped of bells and whistles, the typical retry loop looks like this:

    result = makeServiceCall(parameters)
    numRetries = 0
    while !result.success && numRetries < MAX_RETRIES
        // insert delay before retry
        result = makeServiceCall(parameters)
    end while

    if (!result.success)
        // Raise error. 

We can quibble about the specific implementation, but the general idea is the same: make multiple attempts to get a success response from the service, inserting a delay between calls. The delay can be a fixed amount of time, an exponential fallback with jitter, etc., and you can include all kinds of other logic to improve things, but it all boils down to essentially the same thing.

Under normal circumstances, of course, service calls don’t fail, and this logic works very well. But if the service call failure rate increases, some very bad things can happen.

Imagine a critical service for which you have a contract (a service level agreement, or SLA) that guarantees an average 10 ms response time for up to 1,000 transactions per second (TPS). Remember, though, that this is a shared service. There are many other clients who have similar contracts: 10 ms response time for X TPS. Your application calls that service perhaps 900 times per second, on average. There will be brief periods when your call rate will exceed 1,000 TPS, but that’s okay because the service is scaled to handle large amounts of traffic from many clients. Say the service can guarantee that 10 ms response time for a total of 1,000,000 TPS from all of its clients, combined. Short-term bursts of excess traffic from a few clients aren’t a problem.

Even if calls to the service exceed 1,000,000 TPS, the likely result at first will be increased response time: perhaps latency increases by 50% with a sustained traffic increase of 10%, and doubles when traffic is 20% above the configured maximum. The specific breaking point differs for every service, but in general latency increases non-linearly with the call rate.

Clients, of course, won’t wait forever for a response. They typically configure a timeout (often two or three times the SLA), and consider the call a failure if it takes longer than that. Not a problem with this retry logic: just delay a bit and try again.

As I said above, this kind of thing works fine under normal conditions. But in a large system, lots of things can go wrong.

Imagine what would happen if the service starts getting 1,500,000 requests per second: a sustained 50% increase in traffic. Or one of the service’s dependencies can’t meet its SLA. Or network congestion increases the error rate. Whatever the cause, the service’s failure rate increases, or latency increases beyond the timeout value set by clients. Whatever the cause of the service’s distress, your application blindly responds by sending another request. So if your MAX_RETRIES value is two, then you’ve effectively tripled the number of calls you make to the service.

The last thing a service under distress needs is more requests. Even if your application is not experiencing increased traffic, your retries still have a negative effect on the service.

Some argue that services should protect themselves from such traffic storms. And many do. But that protection isn’t free. There comes a point when the service is spending so much time telling clients to go away that it can’t spend any time clearing its queue. Not that clearing the queue is much help. Even after the initial problem is fixed, the service is swamped with requests from all those clients who keep blindly retrying. It’s a positive feedback loop that won’t clear until the clients stop calling.

The retry loop above might improve your application’s reliability in normal operation. I say “might” because most applications I’ve encountered don’t actually track the number of retries, so they have no idea if the retry logic even works. I’ve seen the following in production code:

  1. A retry loop that always made the maximum number of retries, even if the initial call succeeded.
  2. Retry logic that never retried. That code was in production for two years before anybody realized there was a problem. Why? Because the service had never failed before.
  3. Retry logic that actually did the retries but then returned the result of the first call.
  4. Infinite retry. When a non-critical service went down one day, the entire site became inoperable.

As bad as it is that many programmers apparently don’t test their retry logic, even fewer monitor it. In all the applications I’ve seen with retry logic, only a handful can tell me how effective it is. If you want to know whether your retry logic is working, you have to log:

  • The number of initial calls to the service.
  • The number of initial call failures.
  • The total number of calls to the service (including retries).
  • The number of call successes (including success after retry).

From those numbers, you can determine the effectiveness of the retry logic. In my experience, the percentage of initial call failures to any service under normal operation is less than 1%, and retry succeeds in fewer than 50% of those cases. When a service is under distress and the initial failure percentage gets above about 10%, retry is almost never successful. The reason, I think, is that whatever condition caused the outage hasn’t cleared before the last retry: service outages last longer than clients are willing to wait.

For the majority of applications I’ve encountered, retry is rarely worth the effort it takes to design, implement, debug, test, and monitor. Under normal circumstances it’s almost irrelevant, maybe making the difference between 99% and 99.5% success rate. In unusual circumstances, it increases the load on the underlying service, and almost never results in a successful call. It’s a small win where it doesn’t matter, and a huge liability when it does matter.

If you have existing retry logic in your code, I encourage you to monitor its effectiveness. If, like me, you discover that it rarely provides benefit, I suggest you remove it.

If you’re considering adding retry logic to your code, be sure to consider the potential consequences. And add the monitoring up front.

Setting JVM memory limits

Java has an option, -XX:MaxRAMFraction=XXX, that lets developers specify the maximum amount of available RAM a Java application should occupy. This sounds like a great idea, until you realize how it’s implemented. The number that you specify for “XXX” is an integer that represents the fraction of total RAM that should be used. The intrepretation is 1/XXX. So -XX:MaxRamFraction=1 tells the application to use as much RAM as it wants. XX:MaxRAMFraction=2 tells it to use half of the memory. What if you want to use 75% of the memory? Or, perhaps 2/3 the memory? Tough luck. You get all, one-half, one-third, one-fourth, one-fifth, etc. You want to use 40%? Sorry, buddy. You can get 50% or 33%. Take your pick.

When I first heard this, I thought the guy who was explaining it was misinformed or pulling my leg.

I’d sure like to understand the thinking that went into this design decision, because from where I’m standing it looks like lunacy. I can’t imagine how such an idiotic design could make it through review and implementation without somebody with a modicum of intelligence and authority raising a stink. You mean nobody realized how stupid this is?

Java 10 introduced the -XX:MaxRAMPercentage and associated flags. See for more information.

Sanity prevails, but I still want to understand how the original design got approved. That something so obviously wrong managed to make it through the approval process and was released on an unsuspecting world doesn’t give me much confidence in the competence of Java platform developers.

Treasure from Trash

(The turtle underneath the table, by the way, was something I found in my brother’s things. I think it’s a piece that my dad brought home from his trip to Central America back in the early ’70s.)

Two years ago, I found a rotting tree stump in the campgrounds at the Sherwood Forest Faire. I knew when I first saw it that I wanted to make a coffee table base, so I enlisted a friend to help with loading it on the trailer. Below is what the piece looked like when I found it.

I’ll admit that I wasn’t sure exactly how I would turn that into a coffee table base, or if there was enough good wood underneath all that rot, but I thought it was worth a shot.

Once I got it home, a few hours with a chainsaw, my trusty angle grinder, chisels, and a wire brush, and I was sure I had something worth keeping. Here’s what it looked like once I’d stripped the bark and got down to the solid wood underneath most of the rot.

It took another day of finer work with chisels and assorted other implements of destruction before I had it in the basic shape I wanted.

Most of the work I did to this point was just removing rot, and cutting off protuberances to get the basic shape that I want. That’s the fun part: peeling off the outer layers to reveal the beauty underneath. Nature provides the shape; I just refine it a bit.

I like the rough work mostly because I can see quick progress. Once I get to this point, I know what it’s going to look like. I’ve solved the problem, proven that I can realize my vision. After this point comes the slow work of making sure that the base is steady, leveling the top, smoothing the shape, sanding, and finishing. It’s work, and painstakingly slow. And sanding a piece like this takes a very long time. As with many things, the hard part is finishing the work after I know how it’s going to end.

So I put the piece aside for a year and concentrated on other projects. I came back to it in the fall of 2017, finished the sanding, and put a finish on it. Then it sat in the garage until last month when I pulled it out measured it, and ordered the glass.

My intention, once I completed the rough work, was to finish the piece and sell it. But Debra, who was a bit skeptical when I dragged this thing home, saw the potential and said that she wanted the coffee table for the living room. One of these days, I’ll make a piece that I can sell.

I now understand why pieces like this one command such high prices. The wood is free, but the amount of work and the cost of custom glass work make these things very expensive to produce. But it’s a one-of-a-kind piece of functional art that will last several lifetimes.

I have several more such pieces in various stages of completion. I really need to finish building out my workshop so I can complete them.

Use your data types!

Imagine there’s a program that makes API calls through a proxy to another service. One of the parameters you pass to that proxy is a timeout value: the maximum time that the proxy should wait before giving up and returning with a timeout error. The function call would be something like:

    result = callProxy("SomeService", "ServiceFunction", {parameters}, timeout);

The timeout value is defined in code as:

    double timeout = 10;

And that code works for months or years until some programmer is in there editing things one day and says to himself, “A timeout of 10 milliseconds? That’s crazy. We should give the service at least a full second to respond.” He promptly changes the timeout value to 1000 and goes on his merry way. All the unit tests pass, the code is pushed, and continues to work for more months or years.

Until it doesn’t.

One fateful morning at about 02:00, you’re shocked out of bed by your pager screaming that there’s a problem. Bleary eyed, you stumble to the computer, log in, and find that your page has stopped serving requests. After a long search, you discover that “SomeServer” is experiencing high CPU usage, but you can’t figure out why that would make your program stop serving requests. After all, your program should be getting a timeout error after a second, and the error handling code looks like it works.

At some point you discover the problem: timeout is actually expressed in seconds, not milliseconds. So rather than waiting 10 seconds for a timeout, it’s waiting 16 minutes and 40 seconds.

Yes, something like this actually has happened. It wasn’t pretty.

Granted, quite a few things went wrong here, among them:

  • There was no comment describing the units that the timeout represents.
  • The timeout variable should have been named timeoutInSeconds.
  • The programmer should have looked more closely at the code before changing it.
  • The code reviewer should have caught the error.
  • The automated tests should have simulated a service timeout.

The error should have been caught. But it wasn’t.

A much more effective way to avoid an error like that is to use descriptive data types. If you’re expressing a time span, then don’t use a unit-less value like an integer or a floating point value. Use a data type intended for expressing time spans. In C#, for example, you’d use the TimeSpan, like this:

    TimeSpan timeout = TimeSpan.FromSeconds(10);

It would be very difficult for a programmer to look at that and think, “10 milliseconds isn’t long enough.” And there’s absolutely no need for a comment here because the code says exactly what it does: it creates a timeout value of 10 seconds. It’s nearly impossible to screw this up. If on the off chance some drunken coder did screw that up, the person doing the code review almost certainly wouldn’t.

These types of errors happen more often than you might like to think. Many of us don’t see these because we work in small companies on relatively new code, where the current employees are also the people who wrote the original code. Things are different when you work at a large company where people move around a lot, and the code you’re working on has been there for 10 years and the programmers who wrote it are off doing other things, or have left the company.

Do yourself, and the people who will maintain your code in the future, a favor. Use descriptive data types just like you use descriptive variable names. Unit-less values are errors waiting to happen. And I guarantee you don’t want to be the person who made this type of error when that error happens at 2 AM on a large retail web site, where every minute of down time means millions of dollars in lost sales. Nor do you want to be the person who has to track down such an error in those circumstances.

Jerome J. Mischel, Jr (Jay): April 12, 1960 – January 31, 2018

My brother Jay passed away this afternoon, after a year-long fight with sarcoma, a rare and aggressive form of cancer. He was 57 years old.

Jay was the second of five children, and the oldest boy; 18 months older than I. As children, we were pretty much inseparable. At least that’s how I remember it. My fondest memories of early childhood include him. We began to drift apart when I was nine or ten, about the time he started going to junior high school. Although he did look out for me when we went on Boy Scout camping trips together. I went off to military school a few years later, and we didn’t see much of each other after that. We worked together with my dad for a few years in the late 1980s and early 1990s, but went our separate ways again after Dad died.

From his early teens on, Jay was something of a rebel. He had his own idea of what made a happy and successful life, and it was decidedly different from what Dad had in mind. Early on, he tried to fit in to conventional expectations. After graduating from St. John’s Military School, he worked for a few years and then enlisted in the U.S. Marine Corps, where he served for four years. Following his stint in the Corps, he went to work with Dad, building and operating small motels in Colorado, Arizona, and New Mexico. But his heart really wasn’t in it. Jay’s passion was music, and when Dad died he began to follow that passion.

Jay lived for 25 years in the Branson, MO and Eureka Springs, AR areas, working sound and lights for shows, and working at various other businesses. He struggled financially at times, but he enjoyed his life. Jay was a good man who kept mostly to himself, but his friends could count on him for help whenever they needed it. He was a true fan of music, a talented lights and sound man, and a singer/songwriter. We kept in touch, but weren’t especially close. More my loss than his, I think.

Jay struggled with medical issues: a heart attack and bypass surgery when he was 43, gallbladder surgery a few years later, and a motorcycle accident that fractured his pelvis and permanently damaged one foot. A year ago he began having trouble with his leg, and in May the doctors discovered the sarcoma. Two surgeries, chemotherapy, and radiation were unsuccessful in eliminating the cancer, and in December it became clear that he had little time left. Several of us visited him at Christmas as well as last weekend. Mom was there for over a month. Our sister was with him at the end.

Jay asked that there be a private family service, and that if you want to remember him, make a donation to your favorite charity, “or just be good to each other.”

His last words to me, as I was leaving Tuesday evening, were “See you on the flip side, bro.” Ditto, my friend. We miss you.

Responding to errors at Amazon

Saying that Amazon’s servers handle a lot of traffic is an understatement. Publicly available information estimates something like 180 million unique visitors per month in the US. Amazon reported over seven million orders on Black Friday 2017. The amount of traffic we process is just staggering, and the complexity of the underlying infrastructure is mind boggling.

When errors occur, and they do, distressingly often, we follow a four-step process:

  1. Identification. Figure out what the heck is going on. Understand it well enough so that you can move on to:
  2. Mitigation. Stabilize the service so that the error no longer occurs. This might take the form of increasing capacity, rolling back a recent change, making an emergency patch, throttling a noisy client service, etc.
  3. Correction. Modify code or processes to prevent that specific error from occurring again.
  4. Understanding. Study the event more deeply to understand how our processes allowed that error to occur, and how we can improve our processes to prevent similar errors in the future. The output from this step is a Correction of Errors document, or COE.

The first three steps are pretty standard fare: things you would expect any team that is running mission-critical services to do. The last, though, is surprisingly rare in the industry. Some companies say that they do postmortem investigations and such, but often it’s just lip service. More ominously, many times that investigation is an exercise in assigning blame for the error as a prerequisite to disciplinary action. Nobody wants to make a mistake because somebody will end up being blamed, and possibly fired. Error investigation becomes an exercise in CYA.

The COE process at Amazon is different. The purpose really is to dive deep so that we fully understand how we made the error, and how we can prevent making the same type of error in the future. The guidelines for writing a COE specifically discourage identifying individuals by name. The format is well specified. It seeks to answer the following questions:

  1. What happened?
  2. How did it affect customers?
  3. How did you identify the error?
  4. How did you fix the error, once identified.
  5. Why did the error occur?
  6. What did you learn from this incident?
  7. What will you do to prevent this from happening again?

These are hard questions to answer. The “why” section is especially difficult because you have to keep asking “why” until you get to the root cause of the problem. It’s much harder than it seems at first look.

This process of investigation to find the root cause has spawned a new verb phrase: “to root cause.” As in “I need to root cause that problem.” That drives me bonkers. I refuse to use that particular jargon, and I’ve been known to express my displeasure at its use. I fear, though, that I’m fighting a battle that’s already been lost.

COE reviews can be uncomfortable because the reviewers are very sharp and, adept at asking probing questions. A review can be especially uncomfortable if the reviewers detect that the investigation wasn’t thorough, or that essential information was omitted. Trying to hide mistakes, or trying to deflect responsibility, can get one in trouble. On the other hand, a thorough investigation followed by a COE that shows a deep understanding of the error and meaningful action items to prevent similar occurrences can reflect very positively on the person or people involved.

Writing a COE is, fundamentally, an exercise in taking ownership of one’s mistakes, learning from them, and passing that knowledge on to others.

Amazon understands that errors happen. One of our Leadership Principles is “Bias for Action,” which advocates calculated risk taking. Nevertheless, errors can be costly. The COE process is an attempt to gain something positive from those occurrences so that we can avoid having to pay for them again. The COE document is an opportunity for the people involved to demonstrate other Leadership Principles: ownership, their ability to dive deep and understand things at a fundamental level, and their insistence on the highest standards.

In a very real sense, making a mistake can be a positive thing, for the company and for your career. I’m not saying that you should strive to make mistakes (after all, another Leadership Principle is that leaders “Are right, a lot”), but responding positively to the occasional mistake can be a Good Thing.

Why I hate Outlook

Yesterday at work a co-worker sent me an email with a bulleted list of URLs that I would have to visit today, to make some configuration changes. The URLs have the format<GUID>. Where <GUID> is replaced with a 36-character string like 85d8a9de-9503-457d-8815-18f277847f43. So, for example:


Ever helpful, Outlook tried to pretty this up for me (or perhaps for my coworker). In any case, it helpfully added hrefs to the links, but rather stupidly. It decided that only the part up to but not including the first hyphen was the actual URL. So I got this:

That in itself is no problem. I’ve seen automatic formatting get confused like that before. Usually I can just copy and paste the text, and everything works out. Not this time!

The second thing Outlook did was much more insidious. It replaced some of the hyphens with en dash characters. So when I copied and pasted the URL to my browser, I got a 404 “Not Found” error.

This type of error is nearly impossible to spot by eye. Now that I know what to look for, I can see in the URL below that some of the “hyphens” are slightly longer than others. But it’s not something you’d spot from a quick inspection. Especially not when the URLs are displayed in a variable-width font in a bulleted list.


It took me a few minutes of trial and error before I figured out what was going on here, then a few more minutes of cussing at Outlook before I just copied the list to my trusty plain text editor and replaced the offending characters.

I really detest software that “helps” me like this.

I know, “Don’t use Outlook.” That’s not an option. When you work for a company that has a half million employees on Outlook, you use it, too. Or you suffer from poor productivity for a while before you find yourself out on the street looking for another place to work.

Gold Bitcoin Beanie Baby Bulbs

So, yeah, I’m not the first person to point out the parallels between the recent Bitcoin frenzy and the Dutch tulip mania of the 1630s. Nor, I suspect, am I the first to mention that Bitcoin’s meteoric rise bears shocking resemblance to:

I wasn’t around for the first of those, but I saw the others happen. I even lost a large part of my meager savings in the 1980 gold frenzy. Every one of these events saw people betting their futures on a “sure thing” that “can’t lose.” They were putting their entire life savings into it, borrowing heavily to gamble on a speculative market that seemed like it would just keep going up. And in every case, the bubble burst, wiping out savings in a very short period.

Those bubbles burst because investors flocking to the “can’t lose” scheme drove the prices to levels that were unsustainable. Early investors get in, ride the rise for a while, and then sell to new investors who want the same kind of trip. It becomes a positive feedback loop, until the price becomes too high to attract new investors. One day, somebody wants to get off and discovers that nobody wants to pay what he’s asking for his position. He decides to sell at a loss, at which point everybody else starts to panic and starts unloading as fast as they can, hoping to get out with something.

I don’t know enough about Bitcoin and other crypto currencies to say what, if anything, they’re actually worth, or if the idea of crypto currency has any long-term merit. But the the meteoric increase in Bitcoin prices over the last year, from $750 to $10,000, brings to mind those parallels, and a little bit more research reveals all the signs of a speculative bubble. The number of companies specializing in crypto currency trading has grown tremendously over the past year. There are “network marketing” schemes that pay you for “helping” others get in on the deal. New crypto currencies are popping up. People are borrowing money to invest. People are posting cheerleader messages (“Rah, rah Bitcoin!”) on social media. I’m seeing more hockey stick charts every day. “Look what it’s done in just the last three months!”

There may indeed be some lasting value to Bitcoin and other crypto currencies, just as there was lasting value in Beanie Babies. I saw some at a yard sale last week, going for about 50 cents each.

Birthdays, random numbers, and hash keys

This is a re-post of an article I wrote for my .NET Reference Guide column some years ago.

You’re probably familiar with the Birthday problem, either from your introductory statistics class or from hanging around computer programmers for too long. Briefly, the math shows that the odds of any two people in a group having the same month and day of birth are quite a bit higher than you would intuitively expect. How high? Given a group of just 23 people, the odds are better than 50% that two of those people will share the same birthday. With 50 people in a group, the odds of two sharing the same birthday is above 97%, and by the time you have 60 people, it’s nearly a sure thing.

This falls into the realm of what I call, “Math that just don’t feel right.” Nevertheless, it’s true. We can demonstrate it with a simple program that selects random numbers in the range of 0 to 364 until it repeats itself.

    class Program
        const int NumKeys = 365;
        const int Thousand = 1000;
        const int Million = Thousand * Thousand;
        const int NumTries = 10 * Million;
        static void Main(string[] args)
            // List contains number of items to collision for each iteration.
            List<int> Totals = new List<int>();
            // Do the test NumTries times, saving the result from each attempt.
            for (int i = 0; i < NumTries; ++i)
                int n = DoTest(NumKeys);
                if ((i % 100000) == 0)
                    Console.WriteLine("{0,5:N0} - {1:N0}", i, n);
            // Group the items by their result.
            var grouped = from cnt in Totals
                        group cnt by cnt into newGroup
                        orderby newGroup.Key
                        select newGroup;
            // Output a table of results and counts, with running percentage.
            int runningTotal = 0;
            foreach (var grp in grouped)
                runningTotal += grp.Count();
                Console.WriteLine("{0,3:N0}: {1,7:N0} ({2,8:P2})  ({3,8:P6})", grp.Key, grp.Count(),
                    (double)grp.Count() / NumTries, (double)runningTotal / NumTries);
            Console.WriteLine("Min items to collision = {0:N0}", Totals.Min());
            Console.WriteLine("Max items to collision = {0:N0}", Totals.Max());
            Console.WriteLine("Average items to collision = {0:N0}", Totals.Average());
        // Random number generator used for all attempts.
        static Random rnd = new Random();
        static int DoTest(int nkey)
            HashSet keys = new HashSet();
            // Pick random numbers in the range [0..nkey-1]
            // until a duplicate is selected.
            int key;
                key = rnd.Next(nkey);
            } while (keys.Add(key));
            return keys.Count;

The program isn’t as complicated as it looks. I structured it so that you can run the test many times and then aggregate the results. In this particular case, I run 10 million iterations.

The result of each iteration (the count of random numbers selected before a duplicate was found) is saved in the Totals list. When all iterations are done, the code groups the results so that we know how many times each result occurred. Finally, the grouped results are output with a running percentage. Here’s the output from a sample run:

    1:  27,322 (  0.27 %)  (0.273220 %)
    2:  54,962 (  0.55 %)  (0.822840 %)
    3:  81,735 (  0.82 %)  (1.640190 %)
    4: 107,442 (  1.07 %)  (2.714610 %)
    5: 132,999 (  1.33 %)  (4.044600 %)
    6: 157,841 (  1.58 %)  (5.623010 %)
    7: 181,920 (  1.82 %)  (7.442210 %)
    8: 203,602 (  2.04 %)  (9.478230 %)
    9: 223,787 (  2.24 %)  (11.716100 %)
    10: 241,462 (  2.41 %)  (14.130720 %)
    11: 258,365 (  2.58 %)  (16.714370 %)
    12: 274,581 (  2.75 %)  (19.460180 %)
    13: 286,809 (  2.87 %)  (22.328270 %)
    14: 298,008 (  2.98 %)  (25.308350 %)
    15: 306,072 (  3.06 %)  (28.369070 %)
    16: 314,766 (  3.15 %)  (31.516730 %)
    17: 318,592 (  3.19 %)  (34.702650 %)
    18: 323,361 (  3.23 %)  (37.936260 %)
    19: 323,649 (  3.24 %)  (41.172750 %)
    20: 322,924 (  3.23 %)  (44.401990 %)
    21: 319,847 (  3.20 %)  (47.600460 %)
    22: 316,430 (  3.16 %)  (50.764760 %)
    23: 310,096 (  3.10 %)  (53.865720 %)
    24: 303,122 (  3.03 %)  (56.896940 %)
    25: 295,341 (  2.95 %)  (59.850350 %)
    26: 285,219 (  2.85 %)  (62.702540 %)
    27: 276,433 (  2.76 %)  (65.466870 %)
    28: 265,044 (  2.65 %)  (68.117310 %)
    29: 253,646 (  2.54 %)  (70.653770 %)
    30: 241,188 (  2.41 %)  (73.065650 %)
    31: 228,559 (  2.29 %)  (75.351240 %)
    32: 216,584 (  2.17 %)  (77.517080 %)
    33: 203,608 (  2.04 %)  (79.553160 %)
    34: 190,222 (  1.90 %)  (81.455380 %)
    35: 177,870 (  1.78 %)  (83.234080 %)
    36: 165,091 (  1.65 %)  (84.884990 %)
    37: 153,658 (  1.54 %)  (86.421570 %)
    38: 141,501 (  1.42 %)  (87.836580 %)
    39: 129,984 (  1.30 %)  (89.136420 %)
    40: 118,849 (  1.19 %)  (90.324910 %)
    41: 108,598 (  1.09 %)  (91.410890 %)
    42:  98,732 (  0.99 %)  (92.398210 %)
    43:  89,560 (  0.90 %)  (93.293810 %)
    44:  80,692 (  0.81 %)  (94.100730 %)
    45:  72,640 (  0.73 %)  (94.827130 %)
    46:  65,559 (  0.66 %)  (95.482720 %)
    47:  58,525 (  0.59 %)  (96.067970 %)
    48:  51,877 (  0.52 %)  (96.586740 %)
    49:  46,329 (  0.46 %)  (97.050030 %)
    50:  40,333 (  0.40 %)  (97.453360 %)
    51:  35,917 (  0.36 %)  (97.812530 %)
    52:  31,270 (  0.31 %)  (98.125230 %)
    53:  27,409 (  0.27 %)  (98.399320 %)
    54:  23,349 (  0.23 %)  (98.632810 %)
    55:  20,482 (  0.20 %)  (98.837630 %)
    56:  17,441 (  0.17 %)  (99.012040 %)
    57:  15,390 (  0.15 %)  (99.165940 %)
    58:  13,378 (  0.13 %)  (99.299720 %)
    59:  11,302 (  0.11 %)  (99.412740 %)
    60:   9,638 (  0.10 %)  (99.509120 %)
    61:   8,199 (  0.08 %)  (99.591110 %)
    62:   6,851 (  0.07 %)  (99.659620 %)
    63:   5,876 (  0.06 %)  (99.718380 %)
    64:   4,843 (  0.05 %)  (99.766810 %)
    65:   4,077 (  0.04 %)  (99.807580 %)
    66:   3,433 (  0.03 %)  (99.841910 %)
    67:   3,098 (  0.03 %)  (99.872890 %)
    68:   2,388 (  0.02 %)  (99.896770 %)
    69:   1,903 (  0.02 %)  (99.915800 %)
    70:   1,541 (  0.02 %)  (99.931210 %)
    71:   1,308 (  0.01 %)  (99.944290 %)
    72:   1,210 (  0.01 %)  (99.956390 %)
    73:     893 (  0.01 %)  (99.965320 %)
    74:     666 (  0.01 %)  (99.971980 %)
    75:     588 (  0.01 %)  (99.977860 %)
    76:     463 (  0.00 %)  (99.982490 %)
    77:     369 (  0.00 %)  (99.986180 %)
    78:     290 (  0.00 %)  (99.989080 %)
    79:     235 (  0.00 %)  (99.991430 %)
    80:     185 (  0.00 %)  (99.993280 %)
    81:     161 (  0.00 %)  (99.994890 %)
    82:     102 (  0.00 %)  (99.995910 %)
    83:     103 (  0.00 %)  (99.996940 %)
    84:      59 (  0.00 %)  (99.997530 %)
    85:      55 (  0.00 %)  (99.998080 %)
    86:      48 (  0.00 %)  (99.998560 %)
    87:      40 (  0.00 %)  (99.998960 %)
    88:      29 (  0.00 %)  (99.999250 %)
    89:      16 (  0.00 %)  (99.999410 %)
    90:      15 (  0.00 %)  (99.999560 %)
    91:      11 (  0.00 %)  (99.999670 %)
    92:       8 (  0.00 %)  (99.999750 %)
    93:       4 (  0.00 %)  (99.999790 %)
    94:       3 (  0.00 %)  (99.999820 %)
    95:       7 (  0.00 %)  (99.999890 %)
    96:       3 (  0.00 %)  (99.999920 %)
    97:       2 (  0.00 %)  (99.999940 %)
    98:       1 (  0.00 %)  (99.999950 %)
    99:       1 (  0.00 %)  (99.999960 %)
    100:       1 (  0.00 %)  (99.999970 %)
    101:       2 (  0.00 %)  (99.999990 %)
    102:       1 (  0.00 %)  (100.000000 %)
    Min items to collision = 1
    Max items to collision = 102
    Average items to collision = 24

This result agrees with the math. It shows some other interesting things, too.

Look at how many times the second number selected was a duplicate of the first: 0.27%. And in more than one percent of the iterations, the fifth number selected was a duplicate! All told, the odds of finding a duplicate is 10% with just 14 numbers selected.

You can argue that birthdays aren’t exactly random, and you’d be right. Some days have more births than others. Although those real world differences are significant, they only affect the birthday problem by a few points. Maybe the break even point is really 25 rather than 23.

You can also argue that the numbers returned by System.Random aren’t truly random. Again, you’d be right. You could code up a better random number generator that uses the cryptographic services to generate numbers that are more nearly random, but you’ll find that it won’t have much of an effect on the results.

So what do birthdays have to do with hash keys?

Let’s say you’re writing a program that has to keep track of a few million strings and you want a quick way to index them and eliminate duplicates. A fast and easy solution is to use hash codes as keys, right? After all, there are four billion different 32-bit hash codes, and you’re just going to use a few million.

You can probably see where this is going. What’s the likelihood that you’ll run into a duplicate? That’s right, the birthday problem raises its ugly head here. Let’s look at the math and see what it tells us.

What is the probability of getting a collision when selecting one million items out of a pool of four billion (actually, 2^32, or 4,294,967,296)?

According to the Wikipedia article a coarse approximation, which is good enough for our purposes, is:

    p = 1 - e^(-n^2/(2 * 2^32))

We can wrap that up in a neat little method:

    static double DuplicateProb(long fieldSize, long nItems)
        double minusNSquared = -(nItems * nItems);
        return 1 - Math.Exp(minusNSquared / (2 * fieldSize));

A few tests will tell us if it’s working:

    DuplicateProb(365, 10) = 0.128017828988458
    DuplicateProb(365, 20) = 0.421863457482716
    DuplicateProb(365, 23) = 0.515509538061517
    DuplicateProb(365, 30) = 0.708547055710068
    DuplicateProb(365, 50) = 0.967439570087906
    DuplicateProb(365, 57) = 0.988329429309313
    DuplicateProb(365, 100) = 0.999998876014983

Those numbers don’t agree exactly with the calculated probabilities in the Wikipedia article, but they’re close–within the range of what you’d expect for a coarse approximation.

So, then, about our one million items selected from a field of four billion:

    DuplicateProb(4294967296, 1000000) = 1


How can that be possible? The probability of getting a duplicate within the first million items is so close to certainty that the computer calls it 1.0? This has to be a mistake, right?

It happens to be true. Unfortunately, it’s a bit difficult to prove with the birthday test program because Random.Next() only has a range of 0..2^31. I could modify the program to extend its range, but that’s really not necessary. Instead, I’ve run the program for increasing powers of two and present the results here:

Field Size Min Max Average
2^16 4 1,081 316
2^20 21 4,390 1,281
2^30 342 142,952 41,498
2^31 1,036 193,529 57,606

These are the results of 10,000 runs each. Doing 10 million runs of the 2^16 took quite a while, and the numbers weren’t much different. The only number that changed significantly was the Max value: the maximum number of items selected before a duplicate was found.

The table is more interesting when you replace the Min, Max, and Average values with their corresponding powers of two:

Field Size Min Max Average
16 4 10.08 8.30
20 4.39 12.10 10.32
30 8.42 17.13 15.34
31 10.02 17.56 15.81

For every field size, the average number of items selected before a collision is approximately one half the power of two. Or, just a little more than the square root. That holds true for the birthday problem, too. The square root of 365 is about 19.1, and the 50% probability point is 23.

A really quick way to compute your 50% probability point would be to take the square root. That’s a bit aggressive, but it’ll put you in the ballpark.

More interesting is the Max value in the table. If you think in powers of two, then the Max value is approximately two bits more than the 50% value. That is, 2^30 is a billion items. The square root is 2^15, or 32,768. Two more bits is 2^17, or 128K (131,072). The Max value for 2^30, above, is 142,952: close enough to 2^17 for me.

Given the above, what do you think the numbers for 2^32 would be? The average will be around 83,000, and the Max will be close to 300,000.

That’s right, by selecting only 300,000 items from a field of four billion, you’re almost guaranteed to find a duplicate. You have a 1% probability of collision with just 6,000 items, and a 10% probability of collision with 21,000 items selected.

It’s true that hash codes are not purely random, but a good hash function will give you very close to a uniform distribution. The problem is worse if the hash codes aren’t evenly distributed.

In case you haven’t got the point yet, don’t try to use a hash code as a unique identifier. The likelihood of getting a duplicate key is just too high.

I know what you’re thinking: what about making a 64 bit hash code? Before you get too excited about the idea, take a look at the Probability Table from that Wikipedia article. With a 64-bit hash code, you have one chance in a million of getting a duplicate when you’re indexing six million items. For one million items, it’d be about one chance in 10 million. That sounds pretty unlikely, but it really isn’t when you consider how often some programs run.

Granted, if you were only indexing a million items, you’d probably use something like a HashTable. It’s when you get into very large numbers of items that you begin worrying about the overhead imposed by your indexing method. Using a hash code sounds like a good option, until you discover the likelihood of generating a duplicate. Even with a 64-bit hash code, you have a 1% chance of generating a duplicate with only 610 million items.

Hash codes are great for quickly determining if two items can’t possibly match. That is, if two strings hash to different values, then they positively are not the same. However, hash codes are not a good solution for uniquely identifying items, because the probability of two different items hashing to the same value is surprisingly high.

Live like our ancestors?

I’ve heard people say, during discussions of the evils of modern life, that we should endeavor to “live like our ancestors.” I wonder which of our ancestors they’re talking about.

I wonder if they think we should go back to my childhood, where we lived in constant fear of nuclear war, where there was a single phone in the house, calls out of your immediate area (long distance) were very expensive, and there was no such thing as overnight delivery or even fax machines. If you were lucky you got more than three channels on your television. Fresh fruit and vegetables were seasonal luxuries, and if you were lucky you got two days’ notice of potential hurricanes and floods. Food was more expensive and less plentiful. Airline travel was a very expensive luxury. If you wanted to go across the country it took you several days by train or perhaps a week by bus. And even a rumor of being homosexual could destroy your career and get you killed. A gay man was labeled a pervert, and wouldn’t find any public support. People didn’t acknowledge that gay women existed.

Or how about my parents’ generation, where children regularly died of measles, polio, and other infectious diseases that were mostly a thing of the past when I was growing up? People died of simple infections because modern antibiotics (even penicillin) didn’t exist. Life expectancy was about 65 years in the developed world. My dad recounted stories of trudging through the snow in the middle of the night to the outhouse. That sounds like a great time. Wouldn’t it be just peachy to live in a time when the law enforced discrimination against blacks? Air travel didn’t exist for most people. Family vacations were perhaps a trip to the lake in the summer. If you were lucky there was a telephone in the house. If you didn’t live in a city, you might not even have electricity. Refrigeration involved somebody delivering a block of ice to your house every week, so you could put it in the ice box along with the few bits of food you wanted to preserve.

My grandparents grew up on farms. No electricity, running water, telephones, or cars. Going into town involved hitching up the horse and wagon and spending the whole day getting to town and back. Radio didn’t even exist when my grandparents were kids. Children were born at home, without a physician in attendance. X-rays were unknown. Half the population of the United States spent more than half their time just growing and hunting enough food to live. Average life expectancy was about 50 years. In the early 1900s, the infant mortality rate in the U.S. was about 150 out of 1,000. By contrast, the infant mortality rate in Afghanistan today is about 112 of 1,000. Today in the U.S. it’s less than 6 out of 1000. Yeah, it’d be just grand to live in that time period.

Want to go back further? Until 1865 the law allowed people to own slaves. Yes, another person could be your personal property to do with as you pleased. Sounds fun, huh? Average life expectancy was about 40 years. If you lived to be 10, you were lucky to make it to 50. In 1800, close to 40% of children never lived to be five years old. Go back another 50 years or so and parents were lucky if half their children lived to adulthood. In Colonial America, 90% of people were subsistence farmers; they spent nearly all their time just growing and foraging for food. And it was grueling, backbreaking work.

You don’t want to live like your ancestors lived. You want to live in a bubble with all your modern conveniences and the benefits of a technological society, while the world outside conforms to some unrealistic notion you have of the idyllic life your ancestors lived in. You’re fooling yourself if you believe that they had it better than we do today.


How to confuse a programmer

Computer programming is complicated enough that we really don’t need to look for ways to further confuse programmers. On the contrary, we should actively look for ways to increase programmer understanding. Perhaps one of the easiest ways to confuse programmers is to choose bad names for things. For example, consider this little code snippet that’s supposed to demonstrate using a ManualResetEvent to control thread execution.

private ManualResetEvent suspended = new ManualResetEvent(true);

public void Main()

    // suspend

    // resume


public void Work()
    // long task
    while (!stop)
        suspended.WaitOne(/* optional timeout */);

        //worker task

There’s nothing wrong with the way the snippet works, but naming the event suspended is one of the most confusing things I’ve seen this year. The suspended flag is set when the program is running. WTF? I’m pretty sure I’m not the only programmer who, if he were to encounter this in production code, would very likely misunderstand it.

“I have to wait for the suspended flag to be set before I can resume.”

Come on, people. Think about what you call things. Naming is important!

Take advantage of the early out

The “new thing” when I first started working with computers was Structured Programming. One of the tenets was that a function should have exactly one entry point and exactly one exit point. Some languages (Pascal, for example) enforced this by not having a return statement that would allow you to exit the function from somewhere in the middle.

In retrospect, that probably was a good thing, because people used to do all kinds of things like jump into the middle of a function, exit a function without returning a value, etc. Restricting functions to a single entry point and a single exit point was an effective way to prevent those stupid programmer tricks. But the restrictions are a little too restrictive.

I can’t think of any reason for a function to have multiple entry points, but restricting a function to a single exit point leads to things like this:

    public boolean isExternal(final URI url) {
        boolean target = false;
	if (url.isAbsolute()) {
            // this can be expensive and slow, run only for absolute URLs which should be an exception
            final String serverName = requestAttributesProvider.get().getServerName();
            final String host = url.getHost();
            if (!host.equals(serverName)) {
                target = true;
        return target;

This code has an unnecessary intermediate variable, and an unnecessary level of nesting. All due to the restriction on the number of exit points. If we relax that restriction and allow an early out, the code becomes less cluttered:

    public boolean isExternal(final URI url) {
        // this can be expensive and slow, run only for absolute URLs which should be an exception.
        if (!url.IsAbsolute()) {
            return false;
        final String serverName = requestAttributesProvider.get().getServerName();
        final String host = url.getHost();
        return !host.equals(serverName);

The idea here is to make a decision and then get out. There’s no reason to clutter the rest of the code with the “it’s not an absolute url” context. That allows us to eliminate the target variable, and reduces nesting in the function.

I find that code to be much more clear than the original.

A purist, by the way, would split that into two functions:

    public boolean isExternal(final URI url) {
        // this can be expensive and slow, run only for absolute URLs which should be an exception
        if (!url.IsAbsolute()) {
            return false;
        return urlHostisExternal(url);

    private boolean urlHostIsExternal(final URI url) {
        final String serverName = requestAttributesProvider.get().getServerName();
        final String host = url.getHost();
        return !host.equals(serverName);

On a more general note, this is an anti-pattern:

    if (something) {
        return true;
    else {
        // do other stuff
    return whatever;

If you return from inside the if block, then the else is implied. There’s no need for it. You can reduce nesting and increase legibility by eliminating it:

    if (something) {
        return true;
    // do other stuff
    return whatever;

I can’t think of any case in which such a transformation doesn’t make the code more readable, especially when working with cascading or nested if structures.

The same kind of thing applies to loops. Say you’re doing a sequential search for an item in a list. Structured Programming purists would have you write something like this:

    bool found = false;
    int ix = 0;
    while (ix < list.length && !found)
        if (list[ix] == thingToFind)
            found = true;
    if (found)
        return ix;
        return -1;

We can eliminate a lot of clutter by returning directly from within the loop:

    int ix = 0;
    while (ix < list.length)
        if (list[ix] == thingToFind)
            return ix;
    // here, we know that the item wasn't found, so just return sentinel value.
    return -1;

Again, I can think of no case in which that transformation doesn’t improve readability.

Long-time readers probably wondered why some of the code examples above are in Java. Most of the code I’ll be working on in my new job is Java, so I figured it’d be a good idea to start using it. I’ll still be working with C# in my home projects, and reporting on them here, but there also will be Java from time to time.

How many feet of rope?

Elmer got a new job working for the procurement department of a very large company. boss called him in on his second day, and said that he needs a rope long enough to go all the way around the Earth at the equator. Elmer’s job was to figure out how long the rope has to be, and to get quotes from suppliers: X feet of rope at Y dollars per foot.

Assume for the sake of this example that the Earth is perfectly spherical and smooth.

Elmer went back to his desk, did some calculation to figure out how much rope he’d need, and started burning up the phone talking to suppliers. A few hours later, quotes in hand, he entered the boss’s office and placed the quotes on his desk. Boss glanced at them and then said, “Good work. Before we place the order, this is enough rope to stretch all the way around the Earth, leaving a foot of space between the surface and the rope, right?”

Elmer, surprised, said, “No, sir. You asked for a rope to go all the way around the Earth. You didn’t say anything about a foot of slack.”

Boss pushed the quotes across the desk towards Elmer: “Well, you’ll have to re-do these quotes. I need the rope a foot away from the surface.”

Elmer, being the smart guy he is, replied “That’s okay, we’ll just add XX feet of rope to the original quote.”

What is the value of XX? How did Elmer figure that out so quickly?

My first games programming job was a huge learning experience. It was the first time I’d ever worked with people who were a lot smarter than I was. At least, smarter in terms of computer science. The people I’d worked with before understood specific things about programming, but they didn’t love programming like I did. To most of the people I’d worked with before, programming was just a job. I lived programming! I studied problems at home, solved problems at work, always had a few side projects, etc. Programming was my job and my most engaging hobby. Debra says that she shares me with my first wife: the computer.

One of the great things about working there was that we were always posing little brain teasers to each other. These could vary from math questions such as the above, algorithms for things like finding if a linked list has a loop, or sometimes larger design questions that don’t really have a “right” answer, but rather were designed to make us think about how to approach a problem. In many ways it was like an ongoing job interview. The puzzle that I presented above was one of them. This is one that I got after just a few seconds’ thought, but then had to check my work on the white board.

I’ve posed this question in interviews over the years, when a candidate says he’s good with math. The reason is that it’s a relatively simple problem that somebody who’s “good at math” should be able to solve relatively quickly. Whereas I’m interested in the solution, I’m more interested in how the candidate approaches the problem. Does he know how to set up the equation? Does he ask me questions to clarify the problem? Does he just sit there looking like he has no idea where to start?

Some people think the problem is too specialized, but I think it’s more than reasonable to expect a programmer, especially one who says he’s good at math, to know the equation for the circumference of a circle. This goes double for games programmers!

It’s surprising how many candidates utterly fail when it comes to this question. Some just sit there, dumbfounded. Others get up at the white board and start doodling, but can’t formulate the problem. It’s embarrassing.

The circumference of a circle is two times Pi, times the radius. That is:

C = 2πr

The original quote was for C feet of rope–enough rope to go around the Earth with radius r. The new rope has to be a foot away from the surface of the Earth. That radius is (r+1). So you have the original rope with length 2πr, and the new rope with length 2π(r+1). We want to know the difference. Showing my work, as my instructors always insisted:

C1 = 2πr
C2 = 2π(r+1)
C2-C1 = 2π(r+1) – 2πr
= 2πr + 2π – 2πr
= 2π

Elmer’s answer was 2π, or about 6.28 feet. It only takes about 6.28 feet to move that rope from being against the surface of the Earth to being a foot away.

I’ve had candidates reach that solution and say, “but that can’t be right,” and go back to check their work. I consider that a good thing; being willing to admit that one might have made a mistake is something I look for in the people I work with.

I’ll admit that this fits squarely in the category of “math that just don’t feel right,” but it is right.

I’ve had other candidates argue that the answer is just flat wrong. Even after I show them the math. That’s bad. Not only is arguing with the interviewer bad form, but unsupported statements like, “I know that can’t be the right answer” show an unwillingness to learn and an inability to form reasoned, coherent arguments.

Note that I have nothing against polite, informed, disagreement from a candidate. That shows backbone and an ability to support one’s argument.

It’s a fun interview question, and can tell you a lot about the candidate’s approach to problems, and his reaction to things that are non-intuitive. Will he accept the math, or will he try to argue that the math is somehow wrong?

Wikipedia: Trust, but verify

A recent question on Stack Overflow asked why Quicksort is called Quicksort, even though it sometimes exhibits O(n2) behavior whereas Merge sort and Heap sort are both O(n * log(n)) in all cases. (For those unfamiliar with the terminology, he’s asking why it’s called “Quicksort,” when other algorithms are theoretically faster.) It’s a reasonable question for a beginner to ask.

One person commented that the answer to the question can be found in the Quicksort article on Wikipedia. But the person asking the question rejected that source because “My professor said Wikipedia is not an authentic site.”

I doubt that’s what the professor said, because a college professor telling students not to use Wikipedia would be, at best, irresponsible. I suspect that the professor said Wikipedia should not be cited as a primary source in formal writing, and the student took that to mean Wikipedia is not a reliable source of information. It seems a prevalent opinion among many. I know intelligent people who will not use Wikipedia. At all. For anything. I cannot understand that behavior.

In my experience, Wikipedia is a highly reliable source of information on fact-based topics (historical events,general science,, algorithms and data structures, mathematics, astronomy, etc.), and a good introduction when it comes to political, religious, or other emotionally-charged topics. If I want to know what something is or was, then Wikipedia probably has an article about it, and that article will be accurate enough for me to get the general idea. Quickly skimming an article gives me a shallow understanding: enough that I don’t feel completely ignorant when somebody starts talking to me about the topic.

More critical reading not only supplies details, but also gives me a feel for points of controversy. Because Wikipedia articles, especially articles on current events or highly charged topics, are often edited by multiple people, it’s difficult for any single person or group to present a completely one-sided viewpoint. It’s also fairly easy to determine when a topic has been edited to remove any hint of controversy.

The best thing about Wikipedia is that it contains mostly reliable articles on almost every topic of any import. The second best thing is that those articles cite their sources. Every article has a “references” section in which it lists the sources for factual statements made in the article. If a factual statement does not have a reference, there is an annotation saying so. If the article lacks references in general, there’s a big note at the top of the article saying so.

With the references listed, one can easily spend a few minutes reading the primary source material to get more details, or to see if the article presents an accurate summation of the material. A Wikipedia article, like a well-written research paper, is verifiable. Contrast that with a printed encyclopedia, which rarely cites sources.

There are pretty high standards for Wikipedia articles, and the community of editors does a very good job of ensuring that articles meet those standards. If an article does not, there are prominent annotations in the text. If an article appears to be opinion-based, presents only one side of a controversial topic, lacks references, appears to be purely anecdotal, or runs afoul of any number of standards, there is a clearly marked indication of that in the article itself. I’ve seen things like, “this article lacks references” at the top of an article. Or “reference needed” when an unsupported assertion is made. Countless articles say, “This article does not meet standards.”

In short, a Wikipedia article gives you all the tools you need to gauge the reliability of the information presented. Certainly much more than a newspaper, television news channel, printed encyclopedia, magazine article, random Web site, your favorite politician, etc. Of those, only Wikipedia makes it a point to give you the resources you need to verify the information that’s presented.

Other than scientific journals, I can’t think of a general information source that I trust more than Wikipedia. It’s the first stop I make if I want to learn about anything.

Not that I take everything on Wikipedia as gospel. Like any other source of information, there are inadvertent mistakes and deliberate falsehoods in Wikipedia articles. But my experience has been that the number of such incidents is much smaller than in any other source of general information that I’m familiar with. More importantly, such things are typically identified and corrected very quickly.

I trust information from Wikipedia articles, but I also verify it through other means. By contrast, I’ve come to distrust information from most news sources, and use other means to determine the veracity of their articles. The primary differences here are that Wikipedia articles tell me where I can verify the information, and I’m usually able to verify it. News outlets pointedly do not reveal their sources, and very often I find that their versions of events are, to be kind, not entirely accurate.

How to generate unique “random-looking” keys

A common question on Stack Overflow goes something like this:

I would like help designing an algorithm that takes a given number, and returns a random number that is unrelated to the first number. The stipulations being that a) the given output number will always be the same for a similar input number, and b) within a certain range (ex. 1-100), all output numbers are distinct. ie., no two different input numbers under 100 will give the same output number.


I’m encoding 64-bit numbers using an alphabet. But since the binary representations of the numbers are more or less sequential, a simple mapping of bits to chars results in very similar strings. I’m looking for ways to make the output more “random”, meaning more difficult to guess the input when looking at the output string.

There are variants, of course, but they’re all pretty much the same question. The programmer asking the question is looking for a way to obfuscate sequential keys. That is, there is code that generates sequential keys for some kind of record, but the programmer wants to hide the fact that they’re sequential. So if the nth key generated is G75AB, he wants the (n+1)th to be, perhaps, XQJ57 or XYZZY; certainly not G75AC, which makes it obvious that the keys are sequential.

A related Stack Overflow question is this:

What is the quickest way to check that the random string I’ve generated hasn’t been generated by me before? I’m curious as to how places like imgur (which identifies each image by a unique 5 digit pin) carry this out. Clearly, one would have at the least an O(n) solution (or at best O(n log (n))depending on the algorithm), but since I’m expecting n to be millions, this is going to be an infeasible solution.

This programmer wants to generate “random” keys. He’s decided that the way to generate unique keys is to just generate a random string, see if it’s been used, and if not go back and generate another. See How not to generate unique codes for an explanation of why random selection is an unsupportable algorithm, and How did this happen? for reasons why you shouldn’t be generating random strings.

The best way I know of to generate unique “random looking” keys is by obfuscating sequential numbers. You start with your first key, 1, obfuscate it, and encode the obfuscated number. Then increment the key and do it again.

Encoding is easy: it’s just a base conversion. One common encoding is base=64. In base-64, the integer 1795368205 is encoded as “DSUDaw”. That’s a nice encoding method, but does nothing to hide the sequential nature of keys. The number 1795368206 becomes “DiUDaw”. Anybody who understands base-64 and how numbers are stored in a computer can quickly determine that those two strings represent sequential keys.

Obfuscation can take many forms. One simple but not particularly effective obfuscation would be to invert the order. That is, imagine you’re generating keys from 1 to 100. You could subtract the sequential number from 101. So the first key value you pass to the encoding method is 100. The next is 99, etc. Somebody looking at your keys would assume that you started at 100 and went down to 1. As I said, not particularly effective.

An effective obfuscation technique would take your sequential number and somehow turn it into another number that is wildly different. So 0 becomes, perhaps, 147,486,932. And 1 becomes 46,231. 2 becomes 186,282. There’s no way a casual observer could understand that the encoded versions of those numbers represent sequential keys. But how do you do that?

The answer is a particularly cool bit of math called a modular multiplicative inverse. Despite the name, it’s not a magical incantation. Eric Lippert explains it well in his blog post, Math from scratch, part thirteen: multiplicative inverses. (In case you’re wondering, the acronym he uses in that second sentence, “WOLOG” means, “Without loss of generality“.)

One of the interesting things about the article is that it shows how to create a function that will obfuscate sequential keys that are reversible. That is, I can take a number, obfuscate it by applying a simple bit of math, and then de-obfuscate it at a later time by applying a similarly simple bit of math. The code below, which is just a slight modification of what Lippert presents in his blog post, illustrates that.

    private void DoIt()
        const long m = 101;         // Number of keys + 1
        const long x = 387420489;   // must be coprime to m

        // Compute the multiplicative inverse
        var multInv = MultiplicativeInverse(x, m);

        // HashSet is used to hold the obfuscated value so we can ensure that no duplicates occur.
        var nums = new HashSet<long>();

        // Obfuscate each number from 1 to 100.
        // Show that the process can be reversed.
        // Show that no duplicates are generated.
        for (long i = 1; i <= 100; ++i)
            var obfuscated = i * x % m;
            var original = obfuscated * multInv % m;
            Console.WriteLine("{0} => {1} => {2}", i, obfuscated, original);
            if (!nums.Add(obfuscated))

    private long MultiplicativeInverse(long x, long modulus)
        return ExtendedEuclideanDivision(x, modulus).Item1 % modulus;

    private static Tuple<long, long> ExtendedEuclideanDivision(long a, long b)
        if (a < 0)
            var result = ExtendedEuclideanDivision(-a, b);
            return Tuple.Create(-result.Item1, result.Item2);
        if (b < 0)
            var result = ExtendedEuclideanDivision(a, -b);
            return Tuple.Create(result.Item1, -result.Item2);
        if (b == 0)
            return Tuple.Create(1L, 0L);
        var q = a / b;
        var r = a % b;
        var rslt = ExtendedEuclideanDivision(b, r);
        var s = rslt.Item1;
        var t = rslt.Item2;
        return Tuple.Create(t, s - q * t);

The first few lines of output for that program are:

    1 => 43 => 1
    2 => 86 => 2
    3 => 28 => 3
    4 => 71 => 4
    5 => 13 => 5
    6 => 56 => 6
    7 => 99 => 7
    8 => 41 => 8
    9 => 84 => 9
    10 => 26 => 10

If you run it on your system, you’ll find that every input number from 1 to 100 generates a unique obfuscated number, also from 1 to 100. This technique is guaranteed not to generate a duplicate. See the links above for the mathematical proof.

The multiplicative inverse method of generating obfuscated sequential keys is elegantly simple, effective, and efficient. It works for any number of keys and obfuscates them well enough that somebody would have to expend considerable effort to determine what your value of x is, even if they knew that they had two values that were generated by sequential keys.

By the way, I would suggest that if you use this technique, you don’t use the number 387420489 for your x value. The person trying to determine your obfuscation scheme might run across this article.

The choice of x here is huge. It can be any number that’s larger than m (the number of keys you want the ability to generate), and x and m must be relatively prime. The easiest way to ensure the second condition is to use a prime number for x. If you limit yourself to 64 bit prime numbers, you still have approximately 4,158E17 values to choose from. The likelihood of somebody guessing your number, or even figuring it out by brute force, is rather remote.

In my next entry, I’ll wrap this all up into a nice easy-to-use class.

Serialize enums as numbers rather than as strings

Imagine that your application works with different document types. You have a server that stores documents in a database, and several applications that query the database, and briefly cache query results in shared memory.

One of the fields that’s returned from a search is the document type, which you’ve encoded in a C# enumeration:

    public enum DocumentType
        WordSmasher = 1,
        NumberCruncher = 2,
        MessageMangler = 3

And in your document record:

    public class DocumentSummary
        public string Name { get; set; }
        public DocumentType DocType { get; set; }
        // other stuff

Now, say you serialize that to JSON, converting the enum values to strings. You end up with:


All of the applications include a shared library that defines common data structures and such, like DocumentSummary and DocumentType.

This all works great until you add support for a new type of document: the company’s new PowerPoint competitor, called SleepyAudience. You update your enumeration:

    public enum DocumentType
        WordSmasher = 1,
        NumberCruncher = 2,
        MessageMangler = 3,
        SleepyAudience = 4

Then you compile and test your server code, build one of the other applications to test things with, and release it to the world. Only to find that the other applications, which haven’t yet been rebuilt, die a horrible death when they see the type “SleepyAudience”.

The application crashes because the deserialization code doesn’t know what to do when it finds the string “SleepyAudience” in the DocType. As far as that code is concerned, the DocType field can be an integer value, or it can have one of the first three strings. “SleepyAudience” is not the name of an enum value, as far as the deserializer is concerned.

The easiest solution to this problem is to encode the enumeration value as a number, rather than as a string. Had the DocType been encoded as a number, the deserializer would have read and populated a DocumentSummary instance. It’s then up to the application code to decide what to do with a document type that it is not familiar with. Perhaps the application would filter that document from the list presented to the user. Or it would display “Unknown document type.” Whatever the case, it certainly shouldn’t crash.

You lose a little bit of readability by encoding the enum value as a number, but only if you’re in the practice of eyeballing JSON data. And, yes, it’s often easier for people who are writing code that interacts with your API to understand what “WordSmasher” means. But we programmers are used to associating numbers with names. It’s no particular burden for a programmer to look up the value “1” in the API documentation to determine that it identifies the WordSmasher application.

One solution to this problem, of course, is to make sure that all applications that depend on the enumeration are re-built and released whenever the enumeration changes. That would solve the problem, but it’s not practical when you have dozens of different applications online all the time. You can’t just stop the world for a few hours (or even a few minutes) to apply updates.

You could also argue that the applications themselves are at fault, and I would agree with that assessment. After all, an application shouldn’t crash horribly because it got some bad data. The application should ignore that one data item (in this case, a single document record from a list of documents), log an error message, and continue. But standard serialization libraries aren’t that robust. They either understand the entirety of what’s sent, or they roll over and die.

The Robustness principle says “Be conservative in what you send, be liberal in what you accept.” When applied to deserialization it means that if you don’t understand something, deal with it gracefully; do the best you can. Fail gracefully. Interpret the value or ignore it and supply a default. If you can’t do that, then discard the record. If you can’t discard the record, discard the input. At worst return an empty data set. But under no circumstances should the program go down in flames because it didn’t understand some input data.

Regardless of who’s at fault here, the point is that encoding enumeration values as strings has very little, if any, real benefit, and makes it harder to write robust deserialization code. Especially when working with deserialization libraries (as opposed to rolling your own deserialization code), you should do whatever you reasonably can to ensure that the deserializer can interpret your data. Let the rest of your application decide how to proceed, after the data has been read.

In my opinion, the application I describe above has a more fundamental flaw in that the document types it supports are hard coded in an enumeration. But that’s a discussion for another day.

How did this happen?

Last time I showed two different implementations of the naive method for generating unique coupon codes. The traditional method does this:

        id = pick random number
    } until id not in used numbers
    code = generate code from id

The other is slightly different:

        id = pick random number
        code = generate code from id
    } until code not in used codes

Before we start, understand that I strongly discourage you from actually implementing such code. The naive selection algorithm performs very poorly as the number of items you choose increases. But the seemingly small difference in these two implementations allows me to illustrate why I think that the second version is broken in a fundamental way.

Think about what we’re doing here. We’re obtaining a number by which we are going to identify something. Then we’re generating a string to express that number. It just so happens that in the code I’ve been discussing these past few entries, the expression is a six-digit, base-31 value.

So why are we saving the display version of the number? If we were going to display the number in decimal, there’d be no question: we’d save the binary value. After all, the user interface already knows how to display a binary value in decimal. Even if we were going to display the number in hexadecimal, octal, or binary, we’d store the binary value. We certainly wouldn’t store the string “3735928559”, “11011110101011011011111011101111”, or “DEADBEEF”. So, again, why store the string “26BB62” instead of the the 32-bit value 33,554,432?

I can’t give you a good reason to do that. I can, however, give you reasons not to.

  1. Storing a six-character string in the database takes more space than storing a 32-bit integer.
  2. Everything you do with with a six-character code string takes longer than if you were working with an integer.
  3. Display logic should be part of the user interface rather than part of the database schema.
  4. Changing your coding scheme becomes much more difficult.

The only argument I can come up with for storing codes rather than integer identifiers is that with the former, there’s no conversion necessary once the code is generated. Whereas that’s true, it doesn’t hold up under scrutiny. Again, nobody would store a binary string just because the user interface wants to display the number in binary. It’s a simple number base conversion, for crying out loud!

If you’re generating unique integer values for database objects, then let the database treat them as integers. Let everybody treat them as integers. Except the users. “26BB62” is a whole lot easier to read and to type than is “33554432”.

I’ll be charitable and say that whoever came up with the idea of storing the display representation was inexperienced. I’m much less forgiving of the person who approved the design. The combination of the naive selection algorithm, storing the generated coupon code rather than the corresponding integer, and the broken base conversion function smacks of an inexperienced programmer working without adult supervision. Or perhaps an inexperienced programmer working with incompetent adult supervision, which amounts to the same thing.

The mere existence of the naive selection algorithm in this code is a sure sign that whoever wrote the code either never studied computer science, or slept through the part of his algorithms class where they studied random selection and the birthday paradox (also, birthday problem). The comment, “There is a tiny chance that a generated code may collide with an existing code,” tells us all we need to know of how much whoever implemented the code understood about the asymptotic behavior of this algorithm. Unless your definition of “tiny” includes a 10% chance after only one percent of the codes have been generated.

The decision to store the generated code string surprises me. You’d think that a DBA would want to conserve space, processor cycles, and data transfer size. Certainly every DBA I’ve ever worked with would prefer to store and index a 32-bit integer rather than a six-character string.

The way in which the base conversion function is broken is hard evidence that whoever modified it was definitely not an experienced programmer. I suspect strongly that the person who wrote the original function knew what he was doing, but whoever came along later and modified it was totally out of his depth. That’s the only way I can explain how the function ending up the way it did.

Finally, that all this was implemented on the database looks to me like a case of seeing the world through database-colored glasses. Sure, one can implement an obfuscated unique key generator in T-SQL. Whether one should is another matter entirely. Whoever did it in this case shouldn’t have tried, because he wasn’t up to the task.

However this design came to exist, it should not have been approved. If there was any oversight at all, it should have been rejected. That it was implemented in the first place is surprising. That it was approved and put into production borders on the criminal. Either there was no oversight, or the person reviewing the implementation was totally incompetent.

With the code buried three layers deep in a database stored procedure, it was pretty safe from prying eyes. So the bug remained hidden for a couple of years. Until a crisis occurred: a duplicate code was generated. Yes, I learned that the original implementation didn’t even take into account the chance of a duplicate. Apparently somebody figured that with 887 million possible codes, the chance of getting a duplicate in the first few million was so remote as to be ignored. Little did they know that the chance of getting a duplicate within the first 35,000 codes is 50%.

I also happen to know that at this point there was oversight by a senior programmer who did understand the asymptotic behavior of the naive algorithm, and yet approved “fixing” the problem by implementing the duplicate check. He selected that quick fix rather than replacing the naive algorithm with the more robust algorithm that was proposed: one that does not suffer from the same pathological behavior. The combination of arrogance and stupidity involved in that decision boggles the mind.

The history of this bug is rooted in company culture, where the database is considered sacrosanct: the domain of DBAs, no programmers allowed, thank you very much. And although programmers have access to all of the source code, they aren’t exactly encouraged to look at parts of the system outside of their own areas. Worse, they’re actively discouraged from making suggestions for improvement.

In such an environment, it’s little surprise that the horribly broken unique key generation algorithm survives.

So much for that. Next time we’ll start looking at good ways to generate unique, “random-looking” keys.



A sample text widget

Etiam pulvinar consectetur dolor sed malesuada. Ut convallis euismod dolor nec pretium. Nunc ut tristique massa.

Nam sodales mi vitae dolor ullamcorper et vulputate enim accumsan. Morbi orci magna, tincidunt vitae molestie nec, molestie at mi. Nulla nulla lorem, suscipit in posuere in, interdum non magna.