Ending Spam

A while back I had to implement a simple statistical text classifier for a project I was working on. David dropped the book Ending Spam on my desk, saying that it had the information I needed. Understand, I’m not building a spam filter, so I was somewhat dubious as to the book’s usefulness.

It turns out that the book was quite useful. Everything I needed to know about implementing a simple Bayesian text classifier is contained in the 20 pages of Chapter 4. But the book is much, much better than just that.

The book begins with a couple of chapters on the history of email spam and early attempts to block or filter it. Then comes a chapter about language classification concepts, followed by the above-mentioned Chapter 4 that describes the fundamentals of statistical filtering. With some brief study and a little trial-and-error, you can build an incredibly effective statistical text classifier from just that.

The book goes on to describe the nuts and bolts of tokenizing messages, shows how spammers attempt to obfuscate messages (The Low-Down Dirty Tricks of Spammers) and how to defeat them, storage concerns, and advanced statistical filtering techniques. All in all, the book is chock full of good information.

The book is short on theory and long on practical advice, with good functional descriptions of the math without getting bogged down in statistical theory or formal mathematical notation. Math purists probably cringe at the approach the author uses, but I found it very readable. Granted, I don’t fully understand some of the math, but it’s presented in a way that made it very easy for me to implement in my program.

Throughout the book, spammers are the enemy and those working to combat it are the white hats. That device is used to good effect, but sometimes it’s taken too far. A minor complaint, to be sure, as it doesn’t really detract from the content. But it does become annoying after a while.

The author makes one assertion that might have been true when the book was written but is, if not false, then a little less true now. He asserts that spammers have tried and failed to defeat statistical filters. Many spams now can defeat simple Bayesian filters through a technique known as “Bayesian poisoning,” in which the message contains “word salad”: a bunch of nonsense sentences filled with normally innocent words. The idea is to overwhelm the filter with a large number of non-spam words so that the spam words like “V1agra” will slip through. Newer filters are designed to combat such attacks, but the fact remains that statistical filters must continue to evolve because they can be defeated.

Minor complaints aside, I still highly recommend this book. Whether you’re interested in learning more about how spam works and how filters fight it, or if you’re interested in text classification techniques in general, you’ll find plenty in this book to make it worth the cost.

Weird computer problem

This one is right up there with the strangest problems I’ve ever seen.

We bought the parts for and built four new machines, each with a Gigabyte GA-P35-DS4 motherboard, Intel Core 2 Quad (Q6600) processor, 8 gigabytes of RAM. Three of them are working fine. One of them (mine, unfortunately), exhibits a rather odd flaw: Windows Task Manager reports that there are only two CPUs. On the others, it reports four CPUs. Other than that, the computer seems to run just fine.

I tried all the obvious things: re-seating the CPU, reducing the overclocking (back to 2.4 GHz processor from the 2.8 GHz we had it at), and finally swapping the processor with one of the other machines. No dice. The other machine reports four CPUs and mine still reports just two CPUs.

The problem is almost certainly with the motherboard, but I can’t prove that yet. Is it possible that I have four cores all running, and Windows is reporting the wrong thing? We ran CPU-Z, which also reports only two cores, but I don’t know if it’s getting information from the hardware or from Windows.

The really odd thing here is that I’ve never heard of a partial failure like this. That is, if the motherboard was faulty, wouldn’t the dang thing just not work at all?

In any case, I’m looking for a program that I can boot and have it tell me about the CPU: what type, how fast, and how many cores are actually running. Does such a thing exist? I downloaded the Ultimate Boot CD, but the tools on that disk don’t support the Core 2 Quad. Nor do they appear to identify how many cores are actually functioning.

If you know of a utility I can download and burn to a bootable CD so that I can test the CPU outside of Windows, I’d really like to hear about it. Leave a comment here on the blog, or drop me mail: jim at mischel.com.

Windows Vista secrets

I got another new computer a couple of weeks ago, but I’ve been too busy until recently to actually set it up. The new machine has an Intel Core 2 Quad processor that we’ve overclocked to about 2.8 GHz. 8 gigabytes of RAM and two one-terabyte disk drives. I’ll be stressing that machine pretty soon with the work I’m doing.

I originally put Windows XP 64 bit edition on the machine, but after giving it some thought I decided to go with Vista 64 Ultimate. I had considered Vista last March when I got the dual core machine, but time pressure made installing XP the better choice. Besides, I don’t like being an early adopter on my production machines. But Vista has been out for a year now (more than that, when you consider the open beta), and once you get past all the bitching about how it’s different and breaks stuff, it turns out that it’s better than XP. Certainly the move from XP to Vista doesn’t seem as shocking to me as the move from Windows 2000 to XP was.

The most annoying part isn’t even the new OS, but rather installing my old applications. You don’t realize how much you’ve come to depend on some things until they’re not around anymore. Tracking down and installing all my old software is going to take some time.

In any case, last night I went looking for Vista PowerToys, hoping that somebody had ported the Windows XP PowerToys (which never got ported to 64 bit) to Vista 64. The one I was most interested in is the “Open Command Window Here” tool that adds an item to the context menu of file system folders, giving you a quick way to open a command prompt in that directory.

It turns out that such a thing is built in to Vista, albeit hidden for some strange reason. If you hold down the Shift key and right-click on a folder in Windows Explorer, you’ll see the option in the context menu: Open Command Window Here. This is implemented very well in Vista, and will even silently map a network drive if the folder you click on is located on another machine. Curiously, this only works if you click on the folder in the right-hand pane of Explorer (the list pane?). If you hold down Shift and right-click on a folder in the folders pane, you get a completely different menu.

The hidden context menu has another very useful feature: Copy as Path. If you click on that option, the full path name of the directory is copied to the clipboard as a quoted string. That’s quite handy. I can’t count the times I’ve had to copy/paste from the address bar to get the path, and then copy/paste from the list pane to get the file name. Having it all in one place is going to save me a lot of time.

It strikes me as odd, though, that these options are hidden unless you press the Shift key when you click on the item. Why weren’t they just added to the standard context menu?

High school math to the rescue

If you’ve experimented with Bayesian spam filtering or similar statistical text analysis, you’re probably familiar with the technique of Bayesian combination that involves computing the probabilities for individual terms and then combining those probabilities to come up with a probability of a message being spam. For example, if the individual terms’ probabilities are denoted by p1, p2, p3 … pn, then the probability of a message being spam is computed by dividing the product of the individual term probabilities by that product plus the product of the inverse of the probabilities. That is:

P = p1 x p2 x p3 ... x pn
I = (1 - p1) x (1 - p2) x (1 - p3) ... x (1 - pn)
R = P/(P + I)

This works well when the number of terms is relatively small. However, since probabilities are numbers between 0 and 1, multiplying the probabilities for a large number of terms will quickly underflow even a double precision floating point number, resulting in either 0, negative infinity, or some other non-number that renders your calculation useless. How, then, to solve the problem?

It took me a few minutes to remember logarithms from high school math. Adding logarithms of two numbers and then taking the inverse is the same as multiplying two numbers. That is:

Exp(Log(a) + Log(b)) == a * b

The beauty of using logarithms in this way is that my intermediate results no longer underflow. My individual term probabilities are clamped to be between 0.01 and 0.99, so the range of logarithms (base e) that I’m working with is -4.06 to -0.01. I can add a whole lot of those before I overflow my machine’s floating point number. That is, if I’m using 10,000 terms, my intermediate result will only be -40,605–well within the range of even a single precision float.

Using logarithms solves one problem but raises another. Remember the final calculation:

R = P/(P + I)

There is no way (that I know of) to do the equivalent of addition using logarithms. If I add the logs of the intermediate results, it’s the same as multiplying the numbers. That is, if PL is the sum of the logs of the probabilities, and IL is the sum of inverses, then performing this calculation:

R = Exp(PL/(PL + IL))

Is the same as writing:

R = P/(P * I)

I scratched my head over this one for an embarrassingly long bit before I realized that I could calculate the reciprocal easily enough. That is, I can compute (P + I)/P using the logs because (P + I)/P is the same as P/P + I/P, which works out to 1 + I/P. And I/P is just Exp(IL - PL).

The final code, then, becomes:

PL = Log(P1) + Log(P2) + Log(P3) ... + Log(Pn)
IL = Log(1 - P1) + Log(1 - P2) + Log(1 - P3) ... + Log(1 - Pn)
R' = 1 + Exp(IL - PL)
R = 1 / R'

A few years back, I bought a high school “advanced mathematics” book because it had all the basic stuff in one place–stuff I once memorized and even used sometimes but have forgotten over the years. Every time I get stumped on a math problem, that’s the first place I go. It’s surprising how often a high school math textbook has the answer. And when it doesn’t, it usually has a hint that I can use to quickly find the answer online.

Later: Writing this post revealed a bug in my implementation. Somewhere along the line I had coded Exp(PL - IL), which explains why my preliminary tests yesterday had me scratching my head trying to figure out why my prioritization code wasn’t working right.

Sake update

At the end of July, I wrote about my attempt to make sake, but then failed to write an update about how the experiment progressed. Everything worked fine for a few days: I kept the mixture at the right temperature, and stirred it twice per day as recommended. It was growing quite a nice batch of white koji mold. But when I checked it on the third or fourth day, I noticed a tinge of green mold. When I checked it again 12 hours later, the green mold was quite evident.

The sake instructions say in several places that if you see mold that’s any color but white or gray, you should throw it all away and start over. Continuing the process with the green mold will likely end up making a drink that doesn’t taste very good, and could make me sick to my stomach. Fortunately, no pathogen would be able to live in the sake, so serious illness from a bad batch isn’t a concern.

In any case, my first attempt at making sake was a failure, and I haven’t gone to the effort of trying it again. When I do, I’ll be sure to update here.

Watch that transaction log

This afternoon, the program that adds records to my database started timing out on every transaction. Although I could connect to the database and execute queries, all updates would time out. When I logged in to the SQL Server machine, I noticed that it was responding very slowly. It took a few minutes of poking around before I finally hit on the idea of checking the event log. There I found this message, which was repeated every two minutes:

Autogrow of file ‘Media_log’ in database ‘Media’ was cancelled by user or timed out after 120032 milliseconds. Use ALTER DATABASE to set a smaller FILEGROWTH value for this file or to explicitly set a new file size.

The log file was set to grow automatically by 10% whenever it gets full. What I didn’t realize is that growing the log file involves more than just allocating another bunch of sectors from the disk. Whatever SQL Server does to grow the log file takes longer than two minutes when the log file is 100 gigabytes in size. To make matters worse, my update program had over 40,000 updates pending and no way for me to store them on disk for later. An oversight on my part, I’ll admit. When you’re blowing and going trying to get something up and working, you tend to let silly things like disaster recovery fall by the wayside.

I tried a number of different things before I took a stab in the dark and decided to create a new log file, hoping that SQL Server would give up on the old overflowed file and write to the new one. Fortunately that worked and my data is now safely ensconced in SQL Server and backed up.

I obviously have a bit to learn about administering a SQL Server installation. For now, I’ve backed everything up and am in the process of shrinking the database before bringing my application up again. Tomorrow I’ll be reading up on the different recovery models and implementing a regular backup schedule so I don’t have to go through this again.