High school math to the rescue

If you’ve experimented with Bayesian spam filtering or similar statistical text analysis, you’re probably familiar with the technique of Bayesian combination that involves computing the probabilities for individual terms and then combining those probabilities to come up with a probability of a message being spam. For example, if the individual terms’ probabilities are denoted by p1, p2, p3 … pn, then the probability of a message being spam is computed by dividing the product of the individual term probabilities by that product plus the product of the inverse of the probabilities. That is:

P = p1 x p2 x p3 ... x pn
I = (1 - p1) x (1 - p2) x (1 - p3) ... x (1 - pn)
R = P/(P + I)

This works well when the number of terms is relatively small. However, since probabilities are numbers between 0 and 1, multiplying the probabilities for a large number of terms will quickly underflow even a double precision floating point number, resulting in either 0, negative infinity, or some other non-number that renders your calculation useless. How, then, to solve the problem?

It took me a few minutes to remember logarithms from high school math. Adding logarithms of two numbers and then taking the inverse is the same as multiplying two numbers. That is:

Exp(Log(a) + Log(b)) == a * b

The beauty of using logarithms in this way is that my intermediate results no longer underflow. My individual term probabilities are clamped to be between 0.01 and 0.99, so the range of logarithms (base e) that I’m working with is -4.06 to -0.01. I can add a whole lot of those before I overflow my machine’s floating point number. That is, if I’m using 10,000 terms, my intermediate result will only be -40,605–well within the range of even a single precision float.

Using logarithms solves one problem but raises another. Remember the final calculation:

R = P/(P + I)

There is no way (that I know of) to do the equivalent of addition using logarithms. If I add the logs of the intermediate results, it’s the same as multiplying the numbers. That is, if PL is the sum of the logs of the probabilities, and IL is the sum of inverses, then performing this calculation:

R = Exp(PL/(PL + IL))

Is the same as writing:

R = P/(P * I)

I scratched my head over this one for an embarrassingly long bit before I realized that I could calculate the reciprocal easily enough. That is, I can compute (P + I)/P using the logs because (P + I)/P is the same as P/P + I/P, which works out to 1 + I/P. And I/P is just Exp(IL - PL).

The final code, then, becomes:

PL = Log(P1) + Log(P2) + Log(P3) ... + Log(Pn)
IL = Log(1 - P1) + Log(1 - P2) + Log(1 - P3) ... + Log(1 - Pn)
R' = 1 + Exp(IL - PL)
R = 1 / R'

A few years back, I bought a high school “advanced mathematics” book because it had all the basic stuff in one place–stuff I once memorized and even used sometimes but have forgotten over the years. Every time I get stumped on a math problem, that’s the first place I go. It’s surprising how often a high school math textbook has the answer. And when it doesn’t, it usually has a hint that I can use to quickly find the answer online.

Later: Writing this post revealed a bug in my implementation. Somewhere along the line I had coded Exp(PL - IL), which explains why my preliminary tests yesterday had me scratching my head trying to figure out why my prioritization code wasn’t working right.