A while back I had to implement a simple statistical text classifier for a project I was working on. David dropped the book Ending Spam on my desk, saying that it had the information I needed. Understand, I’m not building a spam filter, so I was somewhat dubious as to the book’s usefulness.
It turns out that the book was quite useful. Everything I needed to know about implementing a simple Bayesian text classifier is contained in the 20 pages of Chapter 4. But the book is much, much better than just that.
The book begins with a couple of chapters on the history of email spam and early attempts to block or filter it. Then comes a chapter about language classification concepts, followed by the above-mentioned Chapter 4 that describes the fundamentals of statistical filtering. With some brief study and a little trial-and-error, you can build an incredibly effective statistical text classifier from just that.
The book goes on to describe the nuts and bolts of tokenizing messages, shows how spammers attempt to obfuscate messages (The Low-Down Dirty Tricks of Spammers) and how to defeat them, storage concerns, and advanced statistical filtering techniques. All in all, the book is chock full of good information.
The book is short on theory and long on practical advice, with good functional descriptions of the math without getting bogged down in statistical theory or formal mathematical notation. Math purists probably cringe at the approach the author uses, but I found it very readable. Granted, I don’t fully understand some of the math, but it’s presented in a way that made it very easy for me to implement in my program.
Throughout the book, spammers are the enemy and those working to combat it are the white hats. That device is used to good effect, but sometimes it’s taken too far. A minor complaint, to be sure, as it doesn’t really detract from the content. But it does become annoying after a while.
The author makes one assertion that might have been true when the book was written but is, if not false, then a little less true now. He asserts that spammers have tried and failed to defeat statistical filters. Many spams now can defeat simple Bayesian filters through a technique known as “Bayesian poisoning,” in which the message contains “word salad”: a bunch of nonsense sentences filled with normally innocent words. The idea is to overwhelm the filter with a large number of non-spam words so that the spam words like “V1agra” will slip through. Newer filters are designed to combat such attacks, but the fact remains that statistical filters must continue to evolve because they can be defeated.
Minor complaints aside, I still highly recommend this book. Whether you’re interested in learning more about how spam works and how filters fight it, or if you’re interested in text classification techniques in general, you’ll find plenty in this book to make it worth the cost.