Half of forever is still forever

You’re probably wondering if this is really necessary. Believe me, I’m a bit surprised by it myself. But every day I see evidence that supposedly competent programmers don’t understand this fundamental point.

What am I talking about?

Let’s say you have two lists. One is a list of accounts and the other is a list of transactions that you need to apply to those accounts. Each type of record has an AccountNumber property. In essence, you have this:

class Account
{
    public int AccountNumber { get; private set; }
    // other properties and methods
}

class Transaction
{
    public int AccountNumber { get; private set; }
    // other properties and methods
}

List<Account> Accounts;
List<Transaction> Transactions;

The items in the Accounts and Transactions lists aren’t in any particular order. Your task is to create output that groups the accounts and transactions, so it will look like this:

Account #1
    Transaction
    Transaction
Account #2
    Transaction
    Transaction
    Transaction
...etc

The naive way to do this is, for each account, search all the transactions to see if its AccountNumber field matches the account number. Something like:

foreach (var account in Accounts)
{
    Console.WriteLine("Account #{0}", account.AccountNumber);
    foreach (var trans in Transactions)
    {
        if (trans.AccountNumber == account.AccountNumber)
        {
            // output transaction
        }
    }
}

If the number of accounts and transactions is even moderately large, this is going to take a very long time. If we say that m is equal to the number of accounts and n is the number of transactions, then this will take time proportional to m * n. Imagine you have 10,000 accounts and 5,000 transactions. Your code will look at every transaction 10,000 times, meaning you end up doing 50 million comparisons.

The faster way to do this is to sort the accounts and the transactions, and then do a standard merge, which is among the oldest concepts in computing. The merge itself takes time proportional to m + n, but sorting is a little more expensive. Sorting will take m log m time for the accounts and n log n for the transactions. So the total time is (m log m) + (n log n) + m + n. Let’s do the numbers:

Sort accountsm log m10,000 * 15150,000
Sort transactionsn log n5,000 * 1470,000
Mergen + m10,000 + 5,00015,000
Total235,000

Now I’ll be the first to admit that these numbers aren’t perfectly comparable. That is, sorting the transactions list is more expensive than iterating over it, so the merge won’t be 200 times faster than the naive method. But it’ll probably be 100 times as fast. At least. And that’s with pretty small lists. If you’re working with 100,000 accounts and a million transactions, you’re talking maybe 22 million operations for the merge and 100 billion operations for the naive method. The merge will complete in a few minutes (if that), and the naive method will take essentially forever.

In practice you could probably merge those two large lists faster than the naive method would do the smaller lists.

All of the above is elementary computer science. Really, they teach this stuff in 100 level computer science courses. And yet every day I see people asking “how do I make this faster?” That’s bad enough. What’s worse–and what makes me fear for the future–is how many people answer with, “use multithreading!” (Or “Use the Task Parallel Library!”) It’s maddening.

If you have a quad-core machine and you get perfect parallelism, then your program will execute in one-fourth of the time it takes to execute on a single thread. But one fourth of 50 million is still 12.5 million. Even if you applied parallel processing to our simple case above, the naive method will still be two orders of magnitude slower than the single-threaded merge.

No amount of parallelism will save a bad algorithm, just as no amount of assembly language optimization will make a bubble sort execute faster than quick sort in the general case.

Remember, a better algorithm gives you orders of magnitude (or even more) increases in performance. Parallel processing gives you, at best, linear increases. Spend your time on improving your algorithm. Then worry about how you might bring multiple threads to bear in order to make it faster.