As programmers, we sometimes lose sight of the problem we’re supposed to solve because we’re focused on the problem we want to solve. Or perhaps on the problem we think we need to solve. Today’s lesson is a case in point.
Imagine you have a very large collection (many thousands) of records that contain individuals’ names and their skills. The data consists of a name followed by a comma-separated list of skills. Something like:
Jim Mischel, programming, wood carving, brewing, curmudgeon
The problem is that some of the skills are misspelled or abbreviated: “mgmt” for “Management,” for example, or “hadup” instead of “Hadoop,” etc.
Your job is to go through the records and fix the misspellings.
Seems easy enough, right? Just go through the skills, look up each word in a dictionary, and correct the entry if the word isn’t in the dictionary. But there are two problems:
- A lot of common terms used in describing skills are domain specific and won’t be in any dictionary.
- Even if there were such a dictionary, how would you determine that “mgmt” is supposed to be “Management,” or that “hadup” should be “Hadoop?”
It’s at this point that we lose sight of the real problem we’re trying to solve. Being programmers, we start thinking about figuring out edit distances, machine learning, support vector machines, etc. In no time at all we’re shaving a yak.
Writing a computer program that will detect and correct misspellings like this is hard. I’m talking Ph.D. research project hard. I guarantee that the person who asked you to solve the problem isn’t willing to let you embark on a multi-year research project.
Fortunately, you don’t have to. You can solve it in a day or two with two simple programs and a little bit of manual effort.
First, write a program that builds a dictionary of all the terms used, along with how often they’re used. You’ll end up with a very large list that looks something like:
Management,200
Mgmt,3
Hadoop,10
Hadup,1
Mamagement,1
...
If you make the assumption that if a term appears very often, it’s probably spelled correctly, then all you’re really concerned about are those terms that occur infrequently. You can decide on a threshold of, say, three or five, and have your program output only those terms. You’ll probably end up with a list of a few hundred infrequently used terms.
That’s where the manual work comes in. You have to look at each word, determine if it’s a misspelling, and if so find out what the correct spelling is. You manually create a file that contains the misspelled term along with the correct term. For example:
mgmt, Management
mamagement, Management
hadup, Hadoop
wood craving, wood carving
etc.
The second program is even easier. All it does is load the replacements file into a dictionary with the key being the misspelled term, and then reads each name record. For each name record, the program checks each term against the dictionary. If it finds a misspelling that’s in the dictionary, it writes the replacement. Otherwise, it writes the original text. The algorithm looks like this:
dict = LoadDictionary();
create output file
for each record in file
output name
for each skill in record.skills
if skill in dict
output replacement text
else
output original skill text
This approach won’t be perfect, but it will be very good. You’ll have to tinker with the threshold value a bit to get the right balance between detecting real misspellings and “false positives”: terms that are used infrequently but aren’t actual misspellings.
You’ll also have to accept that common misspellings will not be detected. Fortunately, common misspellings will be … common. So you can add them to the replacements dictionary as they’re detected, and subsequent passes over the file will catch them. You could do some work with edit distance algorithms to catch the more common misspellings, but it would be a huge amount of work for relatively little gain.
I’ll admit that it’s somewhat dissatisfying having to manually create the replacements file, but it’s hard to argue with the result. Especially if I can get somebody else to do that work. Then all I have to do is create the list of terms and their counts, and write the program that will make the replacements once the manual work is done.
It’s not sexy, but it solves the problem quickly and effectively.