Know when to stop optimizing | Jim's Random Notes

I know I promised that I’d have the second part of my multithreading series, but I haven’t finished writing it up. In the meantime, I ran across this problem today.

Albert Einstein is quoted as saying, “Everything should be made as simple as possible, but not simpler.” That’s probably paraphrasing what he actually said, but the sentiment is the same and applies equally to computer programming and many other fields as it did to his field (theoretical physics). There is beauty and power in simplicity. Today’s lesson is a case in point.

A programmer who interviewed with a mobile development shop was asked how he’d approach the problem of incrementally searching a phone’s list of contacts. For example, the contacts list might contain:

    Earl
    Elaine
    Jerry
    Joe
    Mary
    Melody
    Penelope
    Paul

The desired functionality is that when the user types the first character, the phone should display all of the names that contain that character. So if the user typed “e”, the resulting list would be:

    Earl
    Elaine
    Jerry
    Joe
    Melody
    Penelope

And when the user types the next character, the list would be reduced to those names that contained the new substring. So if the user’s next input was “l”, we’d want to display the contacts that have “el” in their names:

    Elaine
    Melody
    Penelope

The candidate was asked to provide an “efficient” solution to the problem, but apparently the definition of efficiency wasn’t supplied.

I don’t know what the interviewer was expecting, but the interviewee said that he got bogged down trying to develop a solution that makes use of tries. I suspect he was trying to come up with a generalized suffix tree on the fly.

When the candidate asked me about this after his interview, I proposed a much more pragmatic solution. I figured that, given the number of contacts typically found in a phone (a thousand would be a lot), a sequential search that uses the results from the previous search would be plenty fast enough, and dead simple to implement. It would look something like this:

    results = contacts  // initialize results list with all contacts
    search_string = ""
    while (user types letter)
    {
        // append input letter to search string
        search_string = search_string + letter

        // get new results by filtering the previous result set
        results = filter(results, search_string)

        display(results)
    }

    filter(old_results, search_string)
    {
        new_results = new list
        for each contact in old_results
            if contact name contains search_string
                add contact to new_results
        return new_results
    }

The first search has to go through all the contacts, but even searching 1,000 contact names on a mobile phone would be pretty darned quick. Furthermore, the number of contacts to search is likely in the dozens or fewer¹ after the first two or three characters are typed. There’s just no point in optimizing a search of such small lists. In addition, this solution takes very little memory and no pre-processing. It only requires the sequential list of contacts whereas the suffix tree approach takes extra memory proportional to the number and lengths of contact names. Finally, you’d have to update or rebuild that suffix tree every time the user adds, changes, or deletes a contact.

One seemingly simple optimization you could make is to store the position of the the found substring along with each contact in the result set. Then, when the next character is typed, you would only have to check to see if the next character in the contact name matched the new character typed. But adding that optimization would complicate things a bit. You’d have to define a structure to hold the contact and substring position, the first search of the entire contacts list would have to be handled separately, and you’d have to explicitly check string lengths before doing the next character check. All to save a few microseconds. It wouldn’t be worthwhile.

I’m all for fancy data structures and efficient algorithms, but they have their place. It’s a certainty that there are faster ways to search a list of names for substrings, but those solutions are more complex, require more memory, take longer to write and test, and are more difficult to modify. All those things are reasonable if you’re searching a very large list, but with the typically small mobile phone contacts list, the few microseconds saved aren’t worth the expense of those faster solutions. When the simple algorithm is fast enough, use it and move on. There are more interesting problems to solve elsewhere.

Next time: part two of simple multithreading. Really.

1 – An analysis of Project Gutenberg text shows that the most common bigram in the English language, “th”, occurs about 3.8% of the time. I realize that proper names won’t exhibit the exact bigram frequency as normal English text, but I suspect it’d be close. Based on that, after only two characters typed my algorithm will be working with, at most, about 4% of the total names list. So instead of 1,000 names, I’d be searching 40 names. With four characters typed, I’d be working with fewer than 10 names.