Category Archives: Web Crawling

Hey, you deleted my files!

We got a rather strongly worded message the other day from a Webmaster who was threatening legal action because our crawler deleted a bunch of files from his site.  The news that our crawler is capable of deleting files was … Continue reading

Posted in Idiocy, Web Crawling | 3 Comments

More URL Filtering

Last week I mentioned proxies and other URL filtering issues that we’ve encountered when crawling the Web.  A problem that continually plagues us is repeated path components–URLs like these: http://www.example.com/mp3/mp3/mp3/mp3/mp3/song.mp3 http://www.example.com/mp3/mp3/mp3/mp3/mp3/mp3/song.mp3 I don’t know why some sites do that, but … Continue reading

Posted in Web Crawling | Comments Off

Proxy fits

Three years ago I mentioned anonymous proxies as a way to “anonymize” your Internet access. At the time I neglected to mention one of their primary uses: allowing you to surf sites that might be blocked by your friendly IT … Continue reading

Posted in Web Crawling | 1 Comment

Crawler versus the URLs

When you start crawling the Web on even a small scale, you quickly learn that things aren’t nearly as neat and tidy as the RFCs would have you believe.  After just a few weeks of writing code to handle all … Continue reading

Posted in Web Crawling | 1 Comment

Major search engines support robots.txt standard

Google, Yahoo, and Microsoft’s Live Search recently announced standard support for the major robots.txt directives.  This means that you can use the same syntax for robots.txt to control the activities of those three major search engine crawlers.  The common directives … Continue reading

Posted in Web Crawling | Comments Off

One more time: the Internet is public

[Note:  As Michael Covington pointed out, there's plenty of privacy on the Internet--just not on the World Wide Web.] I know I’ve mentioned this before, but I keep running across people who don’t understand that there is no privacy on … Continue reading

Posted in Internet, Web Crawling | 2 Comments

Webbots, Spiders, and Screen Scrapers

Considering what I’m doing for work, you can imagine that when I ran across Michael Schrenk‘s Webbots Spiders, and Screen Scrapers recently, I ordered a copy. The book is a tutorial on writing small Web bots that automate the collection … Continue reading

Posted in Book Reviews, Programming, Web Crawling | Comments Off

Reducing bandwidth used by crawlers

Some site operators block web crawlers because they’re concerned that the crawlers will use too much of the site’s allocated bandwidth. What they don’t realize is that most companies that operate large-scale crawlers are much more concerned with bandwidth usage … Continue reading

Posted in Web Crawling | Comments Off

A variation on the homegrown DOS attack

Tuesday, in How to DOS yourself, I described how to erroneously configure an Apache server and cause what appears to be a denial of service attack. There’s another way to do it that is even more insidious. In Tuesday’s post … Continue reading

Posted in Web Crawling | Comments Off

How to DOS yourself

It’s surprising the things you’ll learn when you write a Web crawler. Today’s lesson: how to be both perpetrator and victim of your own denial of service attack. Not everybody likes crawlers accessing their sites. Most will modify their robots.txt … Continue reading

Posted in Web Crawling | 1 Comment

Opt in or opt out?

I mentioned before that there is a small but very vocal group of webmasters who say that crawlers should stay off their sites unless specifically invited. It is their opinion that they shouldn’t have to include a robots.txt file in … Continue reading

Posted in Web Crawling | 2 Comments

Why every site should have a robots.txt file

People often ask if they need a robots.txt file on their sites. I’ve seen some Web site tutorials that say, in effect, “don’t post a robots.txt file unless you really need it.” I think that is bad advice. In my … Continue reading

Posted in Web Crawling | Comments Off

More On Robots Exclusion

As I mentioned yesterday, the Robots Exclusion Standard is a very simple protocol that lets webmasters tell well-behaved crawlers how to access their sites. But the “standard” isn’t as well defined as some would have you think, and there’s plenty … Continue reading

Posted in Web Crawling | Comments Off

Struggling with the Robots Exclusion Standard

The Internet community loves standards. We must. We have so many of them. Many of those “standards” are poorly defined or, even worse, ambiguous. Or, in the case of robots.txt, subject to a large number of extensions that have become … Continue reading

Posted in Web Crawling | Comments Off

Web Search Ramblings

Most people reading this blog understand conceptually how Google and other search engines work. In brief, they have a program called a Web crawler that goes from one Web site to the next, downloading and storing pages, and extracting links … Continue reading

Posted in Computers, Web Crawling | Comments Off

You want it when?

The web crawler I’m working on, as I’ve mentioned before, is a distributed application. Currently it consists of a URL Server and multiple Crawlers. The basic idea is that the URL Server is a traffic director that tells each Crawler … Continue reading

Posted in Programming, Web Crawling | Comments Off

Bloom Filters in C#

As I’ve pointed out before, writing a Web crawler is conceptually simple: read a page, extract the links, and then go visit those links. Lather, rinse, repeat. But it gets complicated in a hurry. The first thing that comes to … Continue reading

Posted in Programming, Web Crawling | Comments Off

Multi-threaded programming

I’ve been head-down here working on the Web crawler and haven’t had much occasion to sit down and write blog entries. It’s been a very busy but interesting and rewarding time. A high performance distributed Web crawler is a rather … Continue reading

Posted in Programming, Web Crawling | 1 Comment

Crawling Along

After you get your basic web crawler downloading pages and extracting links, you find yourself having to make a decision: how do you feed the harvested URLs back into the crawler? For instance, if I visit www.mischel.com and extract a … Continue reading

Posted in Web Crawling | 3 Comments

Crawling the Web

I’m writing a Web crawler. Yeah, I know. It’s already been done. It seems like everybody’s done some Web crawling. But there’s a huge difference between dabbling at it and writing a scalable, high-performance Web crawler that can pull down … Continue reading

Posted in Web Crawling | 1 Comment