More URL filtering

Last week I mentioned proxies and other URL filtering issues that we’ve encountered when crawling the Web.  A problem that continually plagues us is repeated path components–URLs like these:

http://www.example.com/mp3/mp3/mp3/mp3/mp3/song.mp3
http://www.example.com/mp3/mp3/mp3/mp3/mp3/mp3/song.mp3

I don’t know why some sites do that, but a crawler can easily get caught in a trap and will generate such URLs indefinitely. Or until our self-imposed URL length limit kicks in. Most of the time when that happens, we discover that all the URLs resolve to the same file, and removing the repeated path component (i.e. creating http://www.example.com/mp3/song.mp3) is the right thing to do.

A single repeated component is by far the most common, but we frequently see two or three repeated components:

http://www.example.com/mp3/download/mp3/download/mp3/download/song.mp3
http://www.example.com/mp3/Rush/download/mp3/Rush/download/song.mp3

It’s easy enough to write regular expressions that identify the repeated path components, and replacing the repeats with a single copy is trivial. But it’s not a good general solution.  For example this blog (and many others) uses URLs of the form blog.mischel.com/yyyy/mm/dd/post-name/, so the entry for July 7 is blog.mischel.com/2008/07/07/post-name/. Globally applying the repeated component removal rules would break a very large number of URLs.

This is one of the many URL filtering problems for which there is no good global solution.  Sometimes, repeated path components are legitimate. We can use some heuristics based on the crawl history (i.e. if /mp3/song.mp3 generates /mp3/mp3/song.mp3) to identify problem sites, but in the end we end up having to write domain-specific filtering rules. Manually identifying and coding around the dozen or so worst offenders makes a big dent in the problem.

Another per-domain problem is that of session IDs encoded within the path, or with uncommon parameter names.  For example, we can easily identify and remove common ids like PHPSESSID= and sessionid=, but these URLs will escape the filter unscathed:

http://www.example.com/file.html?exSession=123456xyzzy
http://www.example.com/file.html?exSession=845038plugh
http://www.example.com/coolstuff/123456xyzzy/index.html
http://www.example.com/coolstuff/845038plugh/index.html

It’s easy for humans to look at the first two URLs and determine that they likely go to the same place.  Same for the second pair.  The computer isn’t quite that smart, though, and making it that smart is very difficult.

Developing a system that automatically identifies problem URLs and generates filtering rules is a “big-R” research project–something that we don’t have time to work on at the moment.  Even if we were to develop such a thing, it’d be pretty fragile and would require constant monitoring and tweaking. If a site’s URL format changes (something that happens with distressing frequency), the filtering rules become invalid. Usually the effect will be letting through some stuff that should have been filtered, but in rare cases a change in the input data can lead to the filter rejecting a large number of URLs that it should have passed.

When I started this project, I knew that crawling the Web was non-trivial. But it turns out that the URL filtering problem is much more complex than I expected the entire Web crawler to be.