Eventual consistency and client applications

This is another one of those “I can’t believe I have to address this” posts.

Eventual consistency is sometimes used as an optimization in middle tier and back end processing to help balance the load on busy servers and provide a scalable architectures. In client-centered applications like large Web sites, the idea is to respond to user requests very quickly, with the understanding that for some period of time the data on the server will be inconsistent.

There are other uses of eventual consistency, but this post is about using eventual consistency as part of a client-centric application.

On the surface, implementing an eventual consistency model looks pretty easy. All the server has to do in response to a client request to modify data is queue the request and tell the client, “Okay, it’s done.” The client doesn’t really care when it the update actually happens. The server can take its own good time to update the database or whatever else needs to be done.

If only it were so simple.

Imagine a simple shopping cart like those that you see everywhere on the Web. Using a traditional (transaction-based) model, adding an item to the shopping cart sends a request to the server. The server makes the database modifications required to show that product XYZ is user ABC’s shopping cart. The server doesn’t send a response to the user application until all of the updates are done. After the server returns its response, the user can click the “View Cart” button and see everything that’s in his shopping cart.

With a simple eventual consistency model, the server’s action is somewhat different. The server queues a message that says, “Add product XYZ to the shopping cart for user ABC.” At some point in the future, a database server will see the “Add product to cart” message and process it. While that message is making its way through the queue machinery, the server returns a response to the client application that says, in effect, “The item was added to the cart,” even though there’s no guarantee that the item was actually added. Now, imagine this scenario:

  1. User clicks “Add to cart.”
  2. Server queues message “Add product XYZ to the cart for user ABC.”
  3. Server returns success message to client.
  4. User clicks “View cart.”
  5. Server receives request, “Return contents of user’s cart.”
  6. Server returns cart contents.
  7. Database server receives and processes “Add product” message.

Because the database server didn’t receive and process the “Add product” message before the user requested the cart contents, the user is going to see his cart without that product in it. The system showed the user an inconsistent view of the data, which breaks the cardinal rule for client applications: never astonish the user.

Any attempt at explaining this behavior to users is doomed to fail. “Oh, you clicked on your cart too fast. Just wait a minute and then refresh the page,” is not a proper response. That’s just going to confuse the user even more. Computers are supposed to make things easier for users. Imagine ordering pizza over the phone:

You: “I’d like a medium pepperoni with onions and green peppers.”
Pizza guy: “Anything else?”
You: “And a two liter soda.”
Pizza guy: “Anything else?”
You: “No. That’s all.”
Pizza guy: “So that’s a medium pepperoni with onions and green peppers. That’ll be $11.94”
You: “And my two liter soda.”
Pizza guy: “Oh, right. And your two liter soda. Your total is $14.98.”

If you want your customers to think your application is the digital equivalent of the stoned-out pizza guy, go ahead and implement a naive eventual consistency model. If you want people to take you seriously and actually use your site, take some time to read and understand what Dr. Werner Vogel, CTO and Vice President of Amazon.com has to say about eventual consistency in his article Eventually Consistent – Revisited.

Pay particular attention to the section titled Consistency–Client and Server, where he talks about variations and conflict resolution. Especially notice that at no time does he mention the possibility of a process seeing old data. That is, if a process updates a data item and subsequently reads that data item back, it will never receive an old value. Using our shopping cart example, the client knows that it added an item to the cart. The server returned an acknowledgement. If the client subsequently reads the cart, that item better be there! If the previously added item is not in the cart, your program is in error, and no amount of explaining to the user is going to change that.

The potential for the user getting an inconsistent view of the data has to be immediately obvious to any programmer who’s competent enough to write the application. That being the case, I have to conclude that the programmer somehow thinks it’s okay to confuse the user. What really astonishes me is that the people in charge–the product designers and managers–accept it when programmers say, “That’s just the way things are when you write a scalable system.” One need only point to Amazon.com for a counter example.

An eventual consistency model that presents an inconsistent view to the client is just plain broken. I have yet to see a reasonable defense for confusing the user.

It’s relatively easy to provide session consistency (see Dr. Vogel’s blog post) so that the client’s view remains consistent, and there are simple and effective conflict resolution strategies you can use on the back end to ensure that your eventually consistent data model doesn’t remain perpetually inconsistent.

Eventual consistency is just one way to extend the scalability of a large distributed Web site. One can go very far (much further than the buzzword artists will have you believe) using a traditional transaction-based system. It’s always a good idea to start with a transactional system because it’s easier to implement and prove correct, and it will serve the needs of all but the largest of sites. Eventual consistency is an optimization that’s designed to extend the capatilities of a larger, working, system. It’s quite likely that, if it comes time to extend your system with an eventual consistency model, you can add it as a layer between the client-facing front end and the server-based back end. Large parts of your existing system won’t change. That won’t be the final word on scalability, but it will let you add capacity that will handle the traffic while you implement the final solution.

Starting with an eventual consistency model is premature optimization, with all of the customary pitfalls. It’s likely that a premature implementation will optimize the wrong thing and not scale the way you intended, because what you think will be the hot spots (the areas that need to be optimized) when you start rarely turn out to be the hot spots by the time you get around to having performance problems. Writing an eventual consistency model when you’re struggling to get a handful of users for your product is wasted effort. Worse, it’s wasted effort up front that costs even more down the road when you realize that you have to throw it out and start over.

My advice for those who are considering an eventual consistency model is the same as what I give to those who think their program needs to be as fast as possible. Make your program work. Then make it scale. It takes a bit of effort to make a working system scale, true. But if you don’t have a working system to start with, you won’t have enough customers to make scaling worthwhile.

A billion dollars that nobody wants

If you’re looking for examples of Congressional idiocy, it’s hard to beat the story of $1 Billion That Nobody Wants. In short, there are about 1.5 billion one-dollar coins piled in bags in Federal Reserve vaults. Why? Because nobody wants them. Why is the U.S. Mint still making them? Because Congress said so.

Congress has been trying to shove dollar coins down our throats since the introduction of the Susan B. Anthony dollar in 1979. That turned out to be one of the most unpopular coins in U.S. history, and production stopped after 1981. An increase in dollar coin usage (primarily from vending machines) resulted in another 50 million or so coins being minted in 1999.

In 2000, Congress mandated the Sacagawea dollar. About 1.3 billion of them were minted in that year. Not surprisingly, the coin was highly unpopular with most people, and the number of coins minted per year dropped off sharply.

Still undeterred, Congress passed the Presidential $1 Coin Program in late 2005. This program, modeled after the State Quarter program, began in 2007 and will continue until 2016. It directs the U.S. Mint to product dollar coins with engravings of the presidents on one side. This despite warnings from the Congressional Budget Office that there would be low demand, and the Government Accountability Office warning that unused coin stockpiles and storage costs would increase.

Shockingly, demand for the coins is almost non-existent. Collectors want them. Nobody else cares.

It gets even sillier. Proponents of the Sacagawea dollar were reluncant to sign on to the Presidential coin program until some genius added a provision saying that the Sacagawea dollar must account for at one of every three dollar coins minted in any year.

A couple of quotes from the NPR article struck me as especially funny.

Members of Congress reasoned that a coin series that changed frequently and had educational appeal would make dollar coins more popular. The idea came from the successful program that put each of the 50 states on the backs of quarters.

This is a perfect example of Congressional reasoning. They failed to grasp the most important point. The State Quarter program didn’t magically make people like quarters. People already used quarters. A lot. On the other hand, 30 years of experience show us that people in general just don’t like the dollar coin. One has to think that, after a few major redesigns and a few minor redesigns, the design isn’t the problem. The American public doesn’t want a dollar coin! Stop wasting time and money trying to force one on us.

Here’s the other quote that I found especially amusing. Or frightening, I suppose.

Leslie Paige, who represents watchdog group Citizens Against Government Waste, says the government should withdraw the dollar bill from the market and force Americans to use the coins.

“I think Americans will definitely embrace the dollar coin if they’re just given the opportunity,” she says.

There is a difference, Ms. Paige, between giving me an opportunity and forcing me to use the coin. Please consult your dictionary. And by the way, the most optimistic projections of cost savings by switching from the dollar bill to the dollar coin are about $5 billion over 30 years. That works out to $166 million per year, or less than 5% of what it costs to run Congress for a single year. Just cut the staff of every Senator and Representative by one person, and we’d make up the difference.

But has Congress passed a bill to stop the insanity? Of course not. That would make too much sense. Instead, the Obama Administration has announced that minting of the coins for circulation will be suspended. They’ll still make some for collectors, but that’s about it.

I’ll grant that the amount of money we’re talking about is small. But the reasoning behind the dollar coin idiocy is exactly the same as the reasoning behind everything Congress does. They invent problems and then invent solutions that wouldn’t solve the problems, even if the problems really existed. And yet we continue to choose to pay these people and give them power over us.

We really need to wake up.

There is one source of Truth

In religion, politics, and other endeavors, Truth is an elusive goal. Depending on your beliefs, Truth might be found in the Bible, the Torah, Koran, the Democratic Party platform, or the lessons you learned while traipsing through the woods. Truth, in most endeavors, is highly subjective.

Truth is subjective in programming, too. If you have any doubt, just ask a dozen different programmers to tell you what is the best programming language, the best indentation style, whether domain driven design is a good idea, or whether inversion of control is just a fancy way to say, “do things in the most complicated way possible.” There are plenty of “truths” in programming.

But in a computer program, there can be only one source of Truth. That is, there can be many representations of the data that your program relies on, but only one representation can be considered the Authority. If you create different views of the data or cache some data in order to speed access, you are making a copy that at some point will differ from the Authority. It is no longer Truth.

Once you do this, you have to make a decision. Your choices are:

  1. Periodically invalidate the cache so that it will be updated from the Authority from time to time. This ensures that your cache will reflect the Authority with a maximum latency of some given period of time. The cache represents Truth as it existed the last time the cache was refreshed.This technique works well if your program can function well with data that is slightly out of date. We use this technique in the crawler to cache robots.txt files. If we always required the most up to date robots.txt, the crawler would have to issue two Web requests for every page it downloaded (one for robots.txt, and then one for the page). Instead, our crawler caches a site’s robots.txt file for a maximum of 24 hours. Truth, in this case, is “as it existed the last time I downloaded the robots.txt,” which will never be more than 24 hours out of date.
  2. Update the cache whenever the Authority changes. This sounds like a good idea, but there are drawbacks.First, the Authority has to be built with caching in mind, and must supply an API that clients can plug in to. The clients have to accept the Authority’s caching API, which might be overly restrictive.

    This can also put an unacceptable performance burden on Authority updates, especially if more than one client is updating its cache. If the Authority has to call each client’s update method, then update speed is limited by the speed of all the subscribed clients. If, instead, the Authority posts updates to a message queue, then there won’t be a perceptible delay in Authority updates, but there will be a non-zero and potentially large latency in the cache updates.

    There are many ways of reacting to an update message posted by the Authority. The simplest is to invalidate any cache of the affected data. That can be quite effective, but you have to be careful that your caching layer knows exactly what data it’s holding on to. That turns out to be a rather difficult task, at times.

    This update strategy is usually used when you want to maintain an up-to-date view of the Authority data, but with a different organization. It works best when updates are infrequent. If you’re doing frequent updates to the view, you probably want to re-think the Authority and have it maintain a view that’s more amenable to however you’re querying it.

  3. Understand that your alternate view is a snapshot of Truth as it existed at some point in time, and it is never updated. This works well if you’re reporting on a snapshot, but it’s not a good general caching solution.

There are hybrid solutions that combine options 1 and 2, but in general that’s pretty rare. It seems like the height of folly to implement option 3 if you’re working with live data, but it’s distressingly easy to fall into that trap inadvertently. For example, you might build a denormalized view of some data in your database because querying the normalized view is prohibitively expensive. You initially use that denormalized view for reporting purposes, but then you foolishly decide that you can use it for other things, too. Pretty soon, large parts of your system are depending on the denormalized view, and changes to the Authority aren’t reflected, or aren’t reflected quickly enough. At that point, your system is broken because your user interface isn’t reflecting Truth.

My experience with relational databases has been that if you denormalize the data, you cannot rely on it reflecting any further changes. You can try to write your code so that it maintains the denormalized view whenever updates are made to the normalized data, but those efforts will almost certainly fail. This is especially true over time, when the original developer moves off the project and somebody new who doesn’t understand all of the denormalized structures is assigned to the project. The result is … well, it’s not pretty. I’ve never seen a case in which trying to maintain two separate views of a database worked well over the long term. Don’t try it!

Where relational databases are concerned, your best bet is to design your database so that you can update and query it efficiently. If it’s still too slow after you’re sure that your design is as good as it can be, then you throw hardware at the problem: more memory, a faster processor, faster drives, or a distributed databse.

Note that I’m not necessarily advocating a fully normalized database design. There are very good and compelling reasons to design your database to be partially denormalized. What I’m arguing against is maintaining a denormalized view in addition to a fully normalized view. I know that it’s possible with triggers and other such database machinery. It can even be done well if you fully understand the ramifications of what you’re doing and if you are meticulous in adding and maintaining your triggers. I’ve found, though, that most development teams are incapable of that level of attention to detail.

Segal’s Law states, “A man with a watch knows what time it is. A man with two watches is never sure.” The same holds true when you have more than one source of Truth in your system. You have to understand that, unless you’re querying the Authority, the data you get back will be, at best, slightly out of date. At worst, it will be so wildly out of date that it’s just plain wrong.

Why you shouldn’t use the .NET sort

My friend and business partner David Stafford recently posted a blog entry, .Net’s Sort Is Not Secure. Don’t Use It. Here’s a Better One, in which he shows that the .NET sort implementation (used by Array.Sort and List.Sort, and possibly others) can easily be made to exhibit pathological behavior.

How bad is it? You can construct an array of one million items that will take the .NET sort implementation more than 80 minutes to sort. The average case is something like half a second.

David’s contention is that this is a security vulnerability. Others might disagree with him, but that’s their choice. David is right. It’s not a great stretch to imagine a Web site that, as part of its work, accepts a data file from a user and sorts that data. A malicious user, knowing that the server is using .NET, could construct a data file that causes the sort to exhibit this pathological behavior, causing the site to become unresponsive. This is nothing short of a denial of service attack, made possible by the poor sorting implementation. As David shows in his post, it’s not terribly difficult to construct a worst-case array.

That fits very comfortably within the definition of a security vulnerability.

David makes two other assertions: that the sort is inflexible, and that the sort is slower than it should be, even in the absence of a malicious adversary.

The sort is somewhat flexible in that it lets you supply a comparison delegate. It does not, however, let you supply a swap delegate. That’s okay in many cases. However, if you’re sorting large structures (value types), or if you want to do an indirect sort (often referred to as a tag sort), a swap delegate is a very useful thing to have. The LINQ to Objects sorting algorithm, for example uses a tag sort internally. You can verify that by examining the source, which is available in the .NET Reference Source. Letting you pass a swap delegate would make the thing much more flexible.

David’s tests show that the .NET sort implementation is slower than it could be. In my opinion, it’s slower than it should be. David’s implementation is faster than the .NET sort in the general case, and doesn’t exhibit pathological behavior in the worst case. The worst case is so terrible, in fact, and so easy to provoke, that the .NET sort should be rewritten.

And yet, the .NET team has refused to address this issue. At best, that’s irresponsible. One can only hope that enough users log in and vote to have the issue addressed, forcing the .NET team to reconsider their decision.

David also noted that LINQ sorting is also vulnerable. What he didn’t point out is that LINQ to Objects uses a completely different algorithm than does Array.Sort. The LINQ to Objects sort is a standard naive Quicksort implementation. As you can see from his timings, the LINQ sort is 50% slower than the already tortoise-like Array.Sort in the face of an adversary.

Understand, the .NET sort will be faster than David’s Introsort in the general case if you’re sorting a primitive type (int, double, etc.) and you’re not supplying a comparison delegate. The .NET sort is faster in that case because it has special-case code to sort primitive types. If David took the time to make special-case additions to his sorting algorithm, it would outperform the .NET sort in those cases, as well.

Of course, even the special-case sorts in the .NET runtime are vulnerable in the face of an array constructed to provoke the worst case.

So take David’s advice: don’t rely on the .NET sort. Download his code and use it.

I’m considering putting together something similar to replace the LINQ to Objects sort. The general idea is to create a class called SafeOrderedEnumerable that implements IOrderedEnumerable, and uses David’s Introsort in the CreateOrderedEnumerable method. To invoke it, I’ll create extension methods SafeOrderBy and SafeOrderByDescending so that you can write, for example:

var sorted = myList.SafeOrderBy(x => x);

That should put LINQ to Objects sorting in the same ballpark as sorting an array. Not the same, of course, but close, and it will avoid the potential pathological cases.

In Praise of Technical Debt

The term “technical debt“, as commonly used, refers to the eventual consequences of poor software design or development practices. The Wikipedia article and most other references consider technical debt to be a Very Bad Thing. The literature is filled with examples of development projects whose combined technical debt eventually killed or seriously hampered the company.

There is a huge amount of literature about techniques that claim to reduce or eliminate technical debt, and there are countless design patterns and development practices to go along with all that talk. Those patterns and practices are usually ways of paying up front in order to avoid having to pay more later. And to the extent that they actually meet that standard, using them to avoid technical debt is a Good Thing.

Unfortunately, those techniques intended to avoid technical debt often cause more problems than they solve, and end up costing more than it would have cost to incur the debt.

It seems every software development pundit preaches a “no technical debt” sermon with the fervor that some misguided financial advisors preach a debt-free lifestyle. And, unsurprisingly, they all have their books, articles, software packages, and training programs that will teach you how to avoid technical debt.

Yes, there’s a lot of hype and snake oil in the software development methodologies business.

Just as there are sound financial reasons to incure financial debt, there are sound business and technical reasons to incur technical debt. In fact, I think technical debt is more often a Good Thing. It takes a very strong argument to convince me not to incur many kinds of technical debt.

The comparison of technical debt to financial debt isn’t perfect. When you take out a loan at a bank or other lending institution, you agree unconditionally to repay the money you borrowed, with interest. I know of only two ways you can avoid paying: bankruptcy and death. In bankruptcy, you might have to pay back some of the debt, and if you die you might leave somebody else with the obligation. And, of course, there are the ongoing interest payments. Financial debt is never free.

Technical debt, on the other hand, often doesn’t need to be repaid, and often has no interest payments. It’s a “free loan.” Not always, but in my experience more often than not. And when it does have to be repaid, the cost is usually quite reasonable.

A good example of technical debt causing a problem is in the development of a Web site. Imagine that you have an idea for a new site. You slap something together in a few days or weeks, post it on your site, and it’s immediately a hit. Within weeks you’re getting more traffic than your poor server can handle. It’s pretty easy to expand your site to use more Web servers, but then your back end database server melts down under the load. That, too, is easily expanded, but eventually you reach a point where some critical component of your system is the bottleneck and there’s no easy way to scale it. Adding a faster server with more memory and a faster hard disk just postpones the problem.

You incurred the technical debt when you slapped together a simple Web site without taking into account the possibility of massive scaling, and now it’s time to repay that debt. And it’s painful. You also have to do it pretty quickly or all your customers will move over to your competitor who spent six months developing a copycat site. You might find yourself, as you crawl in bed at 9 A.M. after another sleepless night trying to retrofit your program, lamenting the decision to ignore the scaling problem in favor of getting something working. And you vow never to do that again.

That’s the wrong attitude to have.

To hear the “no technical debt” preachers tell it, the world is full of failures who would have succeeded had they not taken on the technical debt. And they’ll show you the successes who refused to incur technical debt, opting instead to spend the extra time required to “do it right” up front. What they won’t tell you about, because they don’t know of or choose not to mention, are the many successes who just “slap things together” and deal with the consequences successfully, and the many projects that fail despite “doing it right.” Most sites fail, not because they didn’t develop their software correctly, but because their product idea just didn’t fly. It doesn’t matter how well your code base is constructed if your business idea just doesn’t work.

The “no technical debt” preachers are wrong, plain and simple. They’ll have you believe that any technical debt will crater your project. Even those who say that you should repay technical debt as soon as possible are wrong. As with financial debt, the secret to managing technical debt is to examine each case and make an informed decision. You have to balance the likelihood of having to repay the debt against the cost of repaying it. It’s a simple (in concept, sometimes not in implementation) risk / reward calculation. What is the risk of incuring the debt, and what is the potential reward?

In the case of our hypothetical Web startup, the risk is that your server melts down before you can modify the code to be more scalable. But the likelihood of that risk is pretty darned small. First, you have to build something that people actually care about. The truth is that most Web startups turn on their servers and hear crickets. A few people will come to check it out, yawn, and move on to the next cool new thing. If that happens, you’ll be really happy that you didn’t waste a bunch of time writing your code to support massive scaling.

Even if your site starts getting traffic, it’s not like you’ll get a million dedicated users the first month. You’ll see traffic growing, and you’ll have time to refactor or rewrite your code to meet the increased demand. It might be painful–rewrites often are–but it’s unlikely that traffic will increase so quickly that you can’t keep up with it.

Some developers attempt to design for scalability to start, and in doing so end up making everything scalable. They spend a lot of time building a scalability framework so that every component can be scaled easily. Every component is split into smaller pieces, and each of those pieces is built to be scaled. That sounds like a good idea, but there are huge drawbacks to doing things that way.

The first problem is that not all components need to support massive scaling. In most software systems, there are one or two, or at most a small handful of, components that are bottlenecks. Designing those to be scalable is a Good Thing. Time spent making anything else scalable is a waste of resources. Even the time spent on those things you think will be bottlenecks is often a waste, because it’s incredibly difficult to tell where the bottlenecks will be before you have customers banging on your site. In all too many cases, designing for scalability is like attempting to optimize a program as you’re writing it–before you do a performance analysis to determine where the bottlenecks are.

Designing for scalability makes the code more complicated. Techniques like dependency injection and inversion of control are very effective ways to create more flexible systems, but they make the code more complicated by inserting levels of indirection that often are difficult to follow. Taken to their extremes, these and similar techniques create a bloated code base that is less maintainable and harder to change than the “old style” code they replace. This is especially true when a development team loses sight of the objective (make something that works) in favor of the process. When you see a system that has a nearly one-to-one mapping between interface and implementation, you know that the people who designed and wrote it lost sight of the forest because they were too busy examining the trees.

Those who preach these techniques for reducing or eliminating technical debt assume that you know what you want to build and how you want to build it, and that you have the time and resources to become fully buzzword compliant. In the case of a startup business, none of those three things is true. Startups typically have a few guys, an idea, and lots of enthusiasm. They’re going to try things, quickly building a prototype, making it available for others to look at, and then discarding it to try something else when that idea fails. Eventually, if they’re lucky, they’ll hit on something that seems to resonate, and they’ll start concentrating on that idea. That’s the nature of a startup. Those guys are living on ramen noodles and little sleep, hoping that one of their ideas strikes a nerve before the savings runs out. They don’t have time to waste worrying about things like technical debt.

Fred Brooks, in his 1975 book The Mythical Man Month, famously said:

The management question, therefore, is not whether to build a pilot system and throw it away. You will do that. […] Hence plan to throw one away; you will, anyhow.

That’s as true today as it was back then. Time spent making your first prototype scalable is wasted. You’re going to throw it away. If you’re lucky, you’ll reuse some of your underlying technology. But most of what you write in your first attempt will be gone by the time you finish the project.

As an aside, there are those who say that The Mythical Man Month is outdated and that many of its lessons are irrelevant today because we have faster, more powerful, and less expensive computers, better tools, and smarter programmers. I’ve seen studies showing that programmer productivity is five to ten times what it was 35 years ago. Whereas programmers can do more in less time today, we’re also trying to build systems that are two or more orders of magnitude larger and more complex than the systems being built back then. Brooks’ cautions are more pertinent today than they were in 1975 because teams are larger and the problems we’re trying to solve are more difficult.

Avoiding technical debt is like paying for flood insurance and building a dike around your house when you live on the top of a mountain in the desert. Sure, it’s possible that your house will flood, but it’s highly unlikely. And if the water does get that high, you have much more pressing problems that make the insurance policy and the dike irrelevant. You’ve wasted time, money, and other resources to handle an event that almost certainly won’t ever occur, and if it does occur, your solution won’t matter one bit.

So go ahead and incur that technical debt, but do so intelligently, with full knowledge of what it will cost you if it comes due. But know also that in many, perhaps most, cases, you’ll never have to pay it.

Half of forever is still forever

You’re probably wondering if this is really necessary. Believe me, I’m a bit surprised by it myself. But every day I see evidence that supposedly competent programmers don’t understand this fundamental point.

What am I talking about?

Let’s say you have two lists. One is a list of accounts and the other is a list of transactions that you need to apply to those accounts. Each type of record has an AccountNumber property. In essence, you have this:

class Account
    public int AccountNumber { get; private set; }
    // other properties and methods

class Transaction
    public int AccountNumber { get; private set; }
    // other properties and methods

List<Account> Accounts;
List<Transaction> Transactions;

The items in the Accounts and Transactions lists aren’t in any particular order. Your task is to create output that groups the accounts and transactions, so it will look like this:

Account #1
Account #2

The naive way to do this is, for each account, search all the transactions to see if its AccountNumber field matches the account number. Something like:

foreach (var account in Accounts)
    Console.WriteLine("Account #{0}", account.AccountNumber);
    foreach (var trans in Transactions)
        if (trans.AccountNumber == account.AccountNumber)
            // output transaction

If the number of accounts and transactions is even moderately large, this is going to take a very long time. If we say that m is equal to the number of accounts and n is the number of transactions, then this will take time proportional to m * n. Imagine you have 10,000 accounts and 5,000 transactions. Your code will look at every transaction 10,000 times, meaning you end up doing 50 million comparisons.

The faster way to do this is to sort the accounts and the transactions, and then do a standard merge, which is among the oldest concepts in computing. The merge itself takes time proportional to m + n, but sorting is a little more expensive. Sorting will take m log m time for the accounts and n log n for the transactions. So the total time is (m log m) + (n log n) + m + n. Let’s do the numbers:

Sort accounts m log m 10,000 * 15 150,000
Sort transactions n log n 5,000 * 14 70,000
Merge n + m 10,000 + 5,000 15,000
Total 235,000

Now I’ll be the first to admit that these numbers aren’t perfectly comparable. That is, sorting the transactions list is more expensive than iterating over it, so the merge won’t be 200 times faster than the naive method. But it’ll probably be 100 times as fast. At least. And that’s with pretty small lists. If you’re working with 100,000 accounts and a million transactions, you’re talking maybe 22 million operations for the merge and 100 billion operations for the naive method. The merge will complete in a few minutes (if that), and the naive method will take essentially forever.

In practice you could probably merge those two large lists faster than the naive method would do the smaller lists.

All of the above is elementary computer science. Really, they teach this stuff in 100 level computer science courses. And yet every day I see people asking “how do I make this faster?” That’s bad enough. What’s worse–and what makes me fear for the future–is how many people answer with, “use multithreading!” (Or “Use the Task Parallel Library!”) It’s maddening.

If you have a quad-core machine and you get perfect parallelism, then your program will execute in one-fourth of the time it takes to execute on a single thread. But one fourth of 50 million is still 12.5 million. Even if you applied parallel processing to our simple case above, the naive method will still be two orders of magnitude slower than the single-threaded merge.

No amount of parallelism will save a bad algorithm, just as no amount of assembly language optimization will make a bubble sort execute faster than quick sort in the general case.

Remember, a better algorithm gives you orders of magnitude (or even more) increases in performance. Parallel processing gives you, at best, linear increases. Spend your time on improving your algorithm. Then worry about how you might bring multiple threads to bear in order to make it faster.



A sample text widget

Etiam pulvinar consectetur dolor sed malesuada. Ut convallis euismod dolor nec pretium. Nunc ut tristique massa.

Nam sodales mi vitae dolor ullamcorper et vulputate enim accumsan. Morbi orci magna, tincidunt vitae molestie nec, molestie at mi. Nulla nulla lorem, suscipit in posuere in, interdum non magna.