Text encoding is a programmer’s nightmare. Life was much simpler when I didn’t have to know anything other than the ASCII character set. Even having to deal with extended ASCII–characters with values between 128 and 255–wasn’t too bad. But when we started having to work with international character sets, multibyte character sets, and the Unicode variants, things got messy quick. And they’re going to remain messy for a long time.
When I wrote a Web crawler to gather information about media files on the Web, I spent a lot of time making it read the metadata (ID3 information) from MP3 music files. Text fields in ID3 Version 2 are marked with an encoding byte that says which character encoding is used. The recognized encodings are ISO-8859-1 (an 8-bit character set for Western European languages, including English), two 16-bit Unicode encodings, and UTF-8. Unfortunately, many tag editors would write the data in the computer’s default 8-bit character set (Cyrillic, for example) and mark the fields as ISO-8859-1.
That’s not a problem if the resulting MP3 file is always read on that one computer. But then people started sharing files and uploading them to the Web, and the world fell apart. Were I to download a file that somebody in Russia had added tags to, I would find the tags unreadable because I’d be trying to interpret his Cyrillic character set as ISO-8859-1. The result is commonly referred to as mojibake.
The Cyrillic-to-English problem isn’t so bad, by the way, but when the file was originally written with, say, ShiftJIS, it’s disastrous.
My Web crawler would grab the metadata, interpret it as ISO-8859-1, and save it to our database in UTF-8. We noticed early on that some of the data was garbled, but we didn’t know quite what the problem was or how widespread it was. Because we were a startup with too many things to do and not enough people to do them, we just let it go at the time, figuring we’d get back to it.
When we did get back to it, we discovered that we had two problems. First, we had to figure out how to stop getting mangled data. Second, we had to figure a way to un-mangle the millions of records we’d already collected.
To correct the first problem, we built trigram models for a large number of languages and text encodings. Whenever the crawler ran across a field that was marked as containing ISO-8859-1, it would run the raw uncoded bytes through the language model to determine the likely encoding. The crawler then used that encoding to interpret the text. That turned out to be incredibly effective, and almost eliminated the problem of adding new mangled records.
Fixing the existing mangled records turned out to be a more difficult proposition. The conversion from mangled ISO-8859-1 to UTF-8 resulted in lost data in some circumstances, and we couldn’t do anything about that. In other cases the conversion resulted in weird accented characters intermixed with what looked like normal text. It was hard to tell for sure sometimes because none of us knew Korean or Chinese or Greek or whatever other language the original text was written in. Un-mangling the text turned out to be a difficult problem that we never fully solved in the general case. We played with two potential solutions.
The first step was the same for both solutions: we ran text through the trigram model to determine if it was likely mangled, and what the original language probably was.
For the first solution attempt, we’d then step through the text character-by-character, using the language model to tell us the likelihood of the trigrams that would appear at that position and compare it against the trigram that did appear. If we ran across a trigram that was very unlikely or totally unknown (for example, the trigram “zqp” is not likely to occur in English), we’d replace the offending trigram with one of the highly likely trigrams. This required a bit of backtracking and it would often generate some rather strange results. The solution worked, after a fashion, but not well enough.
For the second attempt we selected several managled records for every language we could identify and then re-downloaded the metadata. By comparing the mangled text with the newly downloaded and presumably unmangled text, we created a table of substitutions. So, for example, we might have determined that “zqp” in mangled English text should be replaced by the word “foo.” (Not that we had very much mangled English text.) We would then go through the database, identify all the mangled records, and do the substitutions as required.
That approach was much more promising, but it didn’t catch everything and we couldn’t recover data that was lost in the original translation. Ultimately, we decided that it wasn’t a good enough solution to go through the effort of applying it to the millions of mangled records.
Our final solution was extremely low-tech. We generated a list of the likely mangled records and created a custom downloader that went out and grabbed those records again. It took a lot longer, but the result was about as good as we were likely to get.