Crawler versus the URLs | Jim's Random Notes

When you start crawling the Web on even a small scale, you quickly learn that things aren’t nearly as neat and tidy as the RFCs would have you believe. After just a few weeks of writing code to handle all the special cases and ambiguities that crop up, you’ll start to wonder how the Web manages to work at all. Nowhere is this more evident than when working with URLs.

It’s a pleasant fantasy to believe that a document on the Web can be reached through one and only one URL. That is, our training as programmers pushes us into the belief that the URL http://www.example.com/docs/resume.html is the way to reference that particular document. It might be the preferred way, but it’s certainly not the only way. On most servers, for example, you can drop the “www”, so that http://example.com/docs/resume.html will get you to the same place. We call this “the www problem.”

That’s just the simplest example. Did you know that multiple slashes are irrelevant? That is, http://www.example.com/////docs////resume.html will go to the same place as the two URLs above. You can also do some path navigation within the URL so that http://www.example.com/docs/../docs/resume.html goes to the same place as all the other examples I’ve shown.

You can also “escape” any character within a URL. For example, you can replace a slash (/) with the character string %2F, turning the original URL above into this: http://www.example.com%2Fdocs%2Fresume.html. Most often, escaping is used to remove embedded spaces and special characters that have particular meanings in URLs. Sometimes escaping is done automatically when a user copies a link from a browser and pastes it into an HTML authoring program.

Above are just some of the simplest examples. I haven’t even started on query strings–parameters that you can pass after the path part of a URL. But even without query strings, the number of different ways you can address a particular document on the Web is essentially infinite. And yet a crawler is expected to, as much as possible, determine the “canonical” form of a URL and crawl only that. Crawling the same document multiple times wastes bandwidth (for both the crawler and the crawlee), and results in duplicate data that can only cause more problems for the processes that come along after the crawler has stored the page.

If you haven’t written a crawler, you might think I’m just contriving examples. I’m not. The www problem in particular is a very real issue that if not addressed can cause a crawler to read a very large number of pages twice: once with the www and once without the www. The other issues are not nearly as prevalent, but they are significant–so significant that every crawler author spends a huge amount of time trying to develop heuristics for URL canonicalization. Simply following the specification in RFC 3986 will get you most of the way there, but there are ambiguities that simply cannot be resolved. So we do the best we can.

You might also wonder where these weird URLs come from. The answer is, “everywhere.” Scripts are high on the lists of culprits. They can mangle URLs beyond belief. For example, one script I encountered had the annoying feature of re-escaping a parameter in the query string. The percent sign (%) is one of those characters that gets escaped because it has special meaning in URLs.

So imagine a script reached from the URL http://www.example.com/script.php?page=1&username=Jim%20Mischel. The script appends the username variable to the query string for all links when it generates the page, but it escapes the string. So links harvested from the page have this form: http://www.example.com/script.php?page=2&username=Jim%2520Mischel. “%25” is the escape code for the percent sign. Now imagine following a chain of 10 links all generated by that script. You end up with http://www.example.com/script.php?page=10&username=Jim%2525252525252525252520Mischel.

What’s a poor crawler to do?

We do the best we can, and we have measures in place to identify such situations so that we can improve our canonicalization code. But it’s a never-ending battle. Whenever we think we’ve seen it all, we run into another surprise.